The goal set for the blog post was a tool that gives basic understanding of how the grades in my course behave. That means I want to see some grade distributions for the final and all partial grades, some basic statistics and so forth. I want to see a comparison between tutorial groups to see if there are grade differences between groups (e.g. because of the way tutor Anne and tutor Bob teach). Depending on the course or programme, I may also want to pay attention to attributes that are not based on tutorial groups: perhaps I want to compare scholarship recipients with self-paying students, or recidivists with students taking the course for the first time.
In this supplementary documentation, we start with the final product and I then walk you through all steps of its creation.
The final product
I have prepared a short video that shows what the tool does for those who would like the tool and don’t need to know how I got it:
If you want to use the tool, you can download it here to use you on your own computer. It’s a simple HTML file, so you do not need to install anything. Clicking on it will open it in the browser of your choice.
You can also use it on my website. Data safety is still even if you access the tool on my website: all processing only happens inside your browser. Nothing is sent to my server and in fact, my server does not have a mechanism to receive data on it.
Fake example data
As the tool needs to be able to analyse Canvas grade exports, I start by collecting a couple of grade exports from Canvas courses I have taught in the past.
We want to feed the LLM realistic data that is ==not real data==. I replace all identifiers with fakes in each export. This starts with the course code in the file name. On the student side: I replace all names with fake names from a random name generator. The Canvas-internal student ID is replaced with a random number with a comparable minimum and maximum, and the UM student ID is with a number that has the same structure as a UM student ID but cannot possibly be a real student’s number (like I7542558). On the course side, the names and internal ID numbers of all grade series are replaced with something bland and innocuous like paper, presentation, exam.
If you are concerned about unique patterns in your grade file, you can swap group numbers around or replace grades: students in group 1 are now in group 3, all 6s are now 7s, and so forth.
When I’m done, no identifier remains that can be traced back to a course, coordinator, tutor or student. And yet: this could be a Canvas grade export for a real course.
Here are the fake grade files I have used.
Initial prompt
I write a detailed prompt that explains everything I want, explicitly instructs Claude to get additional documentation for the technical parts (like the Canvas grade export file format) and invites suggestions.
As part of this prompt, I require that Claude check the example files I have provided, ask clarification questions and finally prepares a spec sheet.
The purpose of this project is to create a single-file HTML page that allows me to analyze grades exported from the Canvas LCMS. Me or other users need to be able to drop a Canvas export CSV onto the page, have the page calrify what kind of comparisons should be run and have it then prepare basic graphs and descriptive statistics that can be viewed or printed.
I would like to provide a couple of starting points first:
USER INTERFACE
The output needs to be a single-file HTML that contains all necessary resources. It should not load any additional resources, or send any of the data to a server for processing. The HTML page should run on an air-gapped computer. Do not use React or other tools that require a build process.
The user interface should use the included
mvp.cssstyle sheet and should respect light mode/dark mode preferences of the user (no need to provide a manual toggle). Do not spend time on the look and feel.DATA IMPORT
I have provided two example exports from Canvas, you can find them in the project directory under the names
2026-06-01T1445_Grades-TST5106.csvand2026-06-01T1447_Grades-TST5011.csv.Basic information from my side:
- This tool is for use at Maastricht University. We use Dutch grades from 1 to 10. > Different parts of the university use decimal grades (7.0, 7.1, 7.2), half-mark steps (7.0, 7.5, 8.0) or full grades (7, 8, 9). The standard for passing is commonly 5.5 or 6.0.
- The filename contains the date and time of the export as well as the course code. All should be displayed when presenting the analysis.
- The first three rows are header information. The first row is the actual header, the second indicates how the data for each variable is entered, and the third row indicates the maximum number of points.
- The first five columns are header columns. Columns 2 and 4 can be ignored. Column 1 contains the name in the format “family name, given name” and column 3 contains the student ID number. When listing students in later steps, it is handy to provide ways of sorting alphabetically or by ID number. Column 5 identifies whether a student belongs to specific groups, most commonly in the format “Tutorial Section NN”, placing the student in tutorial group NN. Given that Maastricht University uses PBL, we see tutorial groups in almost all exports, and comparison between groups is one way of analyzing this data. Any other “sections” should be treated as boolean attributes.
- For full course exports, we usually find a test student in the final row. This student should be ignored. The absence of this student usually indicates a partial export.
- After the manually entered grades, Canvas automatically generates columns that translate the grades to a percentage from 0 to 100. These are usually ignored, as Canvas simply multiplies the Dutch grade by 10 and ignores the fact that grades 0.0 to 0.9 do not exist in the Dutch system.
Please do the following:
- check online for documentation on the Canvas Grade Center export file format;
- review the two example files;
- ask me any clarification questions.
DATA ENTRY: COURSE CHARACTERISTICS
Next, the tool should collect information on the structure of the grades contained in the file.
COURSE ATTRIBUTES
The tool should prepopulate the course code and export date/time, and allow the user to enter a course title. The users should also be asked to indicate the minimum grade for a passing mark (default: 5.5)
GRADE ATTRIBUTES
Most exports have multiple partial grades, a final grade, and potentially repetitions of the same for sit and resit. The tool needs to extract the individual grade variables and ask the user to classify them:
- Is this a partial or final grade?
- If partial, what percentage of the final grade does this yield?
- For either: first sit or resit?
It seems efficient to me that the user be provided an interface where they can create the partial assignments with percentages and assign the relevant columns to the sit and resit. The table for TST5106 could look like this
Grade Share Sit Resit Final Grade 100 Final Course Grade (147133) Final Course RESIT Grade (155790) Proposal 50 Proposal (236223) Proposal RESIT (136224) Paper 50 Paper (236222) Paper RESIT (136237) The grade and share should be text boxes, with the share being constrained to a number. For each row, sit and resit should be dropboxes that list all the available grade variables in the uploaded file. The user can then pick the ones they want. The tool should be flexible enough to deal with missing grades, e.g. if a course has only had the first sit and no resit grades are in the file yet. The purpose here is to simply understand which varibale belongs where.
I recommend prepopulating the list with the final grade (always in the top row, always 100%, cannot be removed), and two assignments (partial grade #1, partial grade #2, defaulting to 50%). The tool should offer the option to add or remove rows (other than for the final grade). It is possible to have only a final grade with no partial grades.
The tool should perform a basic sanity check to ensure that the shares add up to 100%. If they don’t, the tool can display an error message, but should not force the user to fix the issue.
If resit grades are available, allow the user to choose how they want the resit grades to be used, with the options being:
- Use only the information from the first sit
- Use only the information from the first attempt (first sit plus resit grade for students who do not have a grade for the first sit, e.g. because they were ill)
- Best grade (for students who have a sit and a resit grade, we use the better grade)
For each of these choices, the tool drops excess grades so that it has only one grade for each combination of assignment and student:
- For option 1, the tool ignores all resit grades.
- For option 2, the tool backfills the sit grades with resit grades when there is a gap.
- For option 3, the tool picks the highest grade across sit and resit.
STUDENT ATTRIBUTES
The most important distinguishing features among students is the tutorial group they belong to. The group association was extracted from column 5, so the tool already knows which tutorial groups exist and who the members are. The tool should allow the user to specify one or more characteristics per group. Most frequently, this will be the tutor who taught that group. Here, too, a table should be provided, giving the user the option to add one or more characteristics. For TST5106, it could look like this:
Tutorial Group Tutor Time slot Specialization 1 Anne 08:30h Econ 2 Anne 11:00h Econ 3 Anne 13:30h Business 4 Bob 08:30h Business 5 Chuck 08:30h Accounting 6 Bob 11:00h Finance 7 Chuck 13:30h Tax Accounting 8 Daisy 11:00h Business 9 Chuck 16:00h Business The user interfact should start with rows for all tutorial groups and one column prepopulated with the name “Tutor”. The user can add and remove columns. It’s possible that the user does not want to add any attributes, removing all columns except Tutorial Group.
In addition, the tool should optionally allow the creation of student subdivision based on the user’s needs. The user can add a name and the categories in the sub-division, and is then given a table of the students with radio buttons to place them into one of the categories. This could be for comparisons between low/high tuition-fee students, male/female, scholarship recipients vs. self-paying students, etc.
For TST5106, this could look like this:
Student categorization: Scholarship status
Categories:
- self-paying
- scholarship
Student ID number self-paying scholarship Faulkner, Jarred I7286285 X 0 Sexton, Shelia I7065747 X 0 Mendoza, Mel I7034128 X 0 Arnold, Noel I7151683 X 0 Barnett, Eloise I7494023 0 X Mccarty, Dylan I7667677 X 0 Atkins, Wilton I7794938 X 0 Bradshaw, Danilo I7755145 X 0 Powell, Gerry I7453620 0 X Flowers, Lorie I7700697 X 0 Walters, Angel I7669668 X 0 Feel free to propose other ways of breaking down the data once you have reviewed the CSV files.
ANALYSIS
Once the user has made their inputs, the tool prepared a series of analyses. The purpose is to get a basic feeling for how the grades behave.
While Dutch grades are in principle numerical, I would prefer that our display and analysis treats them as ordinal whenever possible. This means we lean towards histograms when showing the distribution, we prefer the mode over the mean, and in comparisons, we uses non-parametric tests. Scatterplots and boxplots are also acceptable, though scatterplots likely need some jitter applied to them so the individual data points remain visible.
COURSE-LEVEL
As a base-line, I would like the following analyses:
- Basic grade descriptives (minimum, quartiles and median, maximum) for each grade series in a table.
- Based performance descriptives (participate rate, pass rate) for each grade series in a table.
- Histograms for the final grades and for the partial grades. It is likely good to have one histogram for each partial grade, plus a histogram that shows all partial grades (with color coding and transparency used so all distributions can be seen). This combined histogram likely becomes hard to read with more than 4 partial grades, so it’s possible to omit this if there are too many partial grades.
- A boxplot comparison for all partial grades and the final grade in one graph.
- Scatterplots for all possible combinations of partial grades, so it becomes possible to see if a good performance in assignment #1 is associated with a good performance in assignment #2.
COMPARISONS
The analyses are then repeated for each group attribute (including group number!) and for the optional student sub-division. For the example choices made here for TST5106, we should get the same battery of analysis 6 times:
- whole course
- comparison between 9 tutorial groups
- comparison between 4 tutors (Anne, Bob, Chuck and Daisy)
- comparison between 4 timeslots (08:30h, 11:00h, 13:30h, 16:00h)
- comparison between 5 specializations (Econ, Business, Accounting, Finance, Tax Accounting)
- comparison between scholarship status (Self-paying, scholarship)
For the comparisons, the analyses should be simplified somewhat as we cannot display two levels of grade comparison. A histogram of two partial grades times nine tutorial groups gets very messy. What seems feasible:
- summary tables for each final and partial grade, placing the numbers for the different comparison groups next to each other
- comparison histograms for each final and partial grade for small comparison sets (same standard of up 4 histograms)
- comparison boxplots for each final and partial grade for all comparison sets
Please review these instructions, raise any issues and feel free to suggest additional features. Please prepare a spec sheet for my approval. Once approved, please build this tool.
As you can see, the initial prompt is quite extensive, coming to almost 2000 words. I use example tables based on the fake example data to guide Claude towards a specific approach, and I provide an itemized list of the analyses I want to obtain at the end to ensure that Claude can check the logic of my description against a concrete example.
You can see the questions that Claude asked here, and the spec sheet it has written here.
Follow-up questions
After ingesting the prompt and looking through the example data, Claude first asks concrete follow-up questions. In this case, Claude had two batchs of follow-up questions, starting with design issues and moving to technical choices.
Within Claude Code, each question is paired with a couple of reasonable answer choices, each of which can be amended. In addition, there is the option to type an answer not foreseen by Claude, or to have a discussion about this on the side in order to determine the best answer.
Below, I provide the question and my answer, edited lightly for readability.
Design questions
Claude: How should the tool treat the different ‘no real grade’ markers — a truly empty cell, the literal ‘N/A’, and a 0 — when computing participation rate and pass rate? (Dutch grades run 1–10, so 0 isn’t a real mark.)
Answer: There is no hard distinction, because not getting the grade implies a fail anyway. In practice: empty, N/A and 0 should be treated as absent.
Claude: When a file has no explicit final-grade column (like TST5011), should the tool compute a weighted final grade from the partials and their shares?
Answer: Yes, compute & offer it. When no final column is assigned, derive final = weighted sum of partials (using the shares). Also offer it as an optional ‘check’ column even when a manual final exists.
Claude: How much statistical inference do you want in the group comparisons, beyond descriptives and charts?
Answer: Non-parametric tests + effect size. Mann-Whitney U (2 groups) / Kruskal-Wallis (3+ groups) with p-values, plus a simple effect-size measure. Matches your ordinal/non-parametric preference.
Claude: Decimal grades (e.g. 9.16, 7.85) need binning for histograms. What bin width should drive the histograms by default?
Answer: lways 0.5-wide bins unless the grades are whole numbers only, in which case use whole-grade bins
Technical queries
Claude: Confirm the charting approach for the single-file tool?
Answer: The most important determinant for me is that the user be able to save the resulting analysis, e.g. by printing it to PDF. If that is given, I prefer to embed a well-established JS chart library that follows industry best-practices.
Claude: Re-entering student subdivisions (and tutorial attributes) for 100+ students is tedious. Should the tool let you save/load your configuration so you don’t redo it?
Answer: Auto-save to browser only. Persist in-browser (localStorage) so a reload restores work, but no portable file to share. Note: localStorage may be unavailable on locked-down air-gapped machines.
Claude’s spec sheet
After the initial clarification, Claude drafts a spec sheet that covers all key choices.
The spec is longer than the initial prompt, and is reproduced here as a formatted PDF export:
Upon completion of the spec, Claude highlights the items that I need to prioritize in my review:
Please review the spec and let me know if you want any changes before I write the implementation plan. A few highlights worth your eye in particular:
- §3 Chart library — Plotly.js inlined (~3.5 MB file, SVG output for crisp PDF). Vega-Lite noted as a swappable alternative.
- §5.5 Grade semantics — empty / N/A / 0 all = absent; pass rate is computed over all students (absentees count as not-passed), with the participant-only rate shown alongside. This follows your “not getting the grade implies a fail anyway.”
- §6.2 Computed final — when no final column exists (TST5011), final = weighted sum of partials, with a missing partial counted as 0; flagged for affected students.
- §8.3 Comparisons — Mann-Whitney U / Kruskal-Wallis + effect size; groups with n < 3 shown descriptively but excluded from the test.
- §9 Persistence — localStorage only, degrading silently to stateless if blocked.
- §12 Out of scope (v1) — one file at a time, no cell editing, no downloadable config file, no parametric stats. Flag anything here you actually want pulled into v1.
I strongly recommend reading the entire spec in detail, and to flag anything that might be problematic. This process is similar to asking a supervisee to send you a summary of the supervision meeting. If your supervisee’s summary matches with your memory, you’re on the same page. If not, misunderstandings can be caught before work is invested. And on occasion, their thoughts inspire thoughts of your own, pushing the work forward. I think it’s important not to anthropomorphise an LLM; doing so misunderstands its nature and introduces risks bigger than a project set-back. With our students, we want to err on the side of room for creativity and growth. With a coding agent with access to our computing resources, we want to err on the side careful, pedantic checking.
In this case, I had omitted a feature I wanted from the original prompt by accident: the ability to sort the student list by name and ID. I was pleasantly surprised to find that this feature was already added to the spec.
Claude’s implementation plan
As part of the follow-up-questions listed above, Claude proposed that it would complete this project with sub-agents. That means that model I’m working with, Opus 4.8, breaks the overall task into smaller deliverables, writes an implementation plan that explains each individual deliverable and how they fit together. The plan is even more extensive:
With this plan, individual tasks can be handed to a smaller model, here Claude Sonnet 4.6.
Claude’s tests
While the small model does the coding, the large model writes tests that the code has to pass. For this project, examples of tasks include being able to open a Canvas grade export file, calculating statistics, making graphs and so forth. Each of these functions must pass quality control – so the statistics code that is supposed to calculate the Mann-Whitney U test needs to produce small p-values when groups have very different grades.
The tests themselves are written in JavaScript. I have collected the complete set of tests here for you to review.
While I code, I don’t use JavaScript. I can read the code and get the gist, but I’m not aware of this language’s peculiarities. But even so: each test comes with a clear, human-readable explanation of what this test checks for. The statisticals tests include checks like mode returns the most frequent value, Kruskal-Wallis: separated groups give small p, Kruskal-Wallis: similar groups give large p. If I am missing tests that I think should be happening, if I don’t understand the code or think there is an error, I can always intervene by either asking questions or updating the documents and alerting the LLM of my changes.
Testing and iterating
After some work, Claude produces the final output. Because we are dealing with sensitive data, I test for data leakage first.
Because our HTML file was designed to (a) download nothing from the internet, and (b) upload nothing to the internet, the first test is to open a browser, switch on developer tools, switch to the network tab and then open the brand-new HTML file. The network graph shows nothing except the file itself. I keep the network graph open for the rest of the test to ensure no up- or downloads happen at a later stage.
![/assets/img/network-traffic.png] Figure 1: the browser reports no incoming or outgoing network traffic
As part of the test, I use a fake grade file that I did not share with the LLM before. It is still fully anonymised for safety (in case the is any network traffic after all). It is also completely different data: in case the example data I have provided prior to coding has some quirks, a fresh data file may help us discover this problem.
I also test the output by calculating the same statistics and graphs myself using Excel and STATA to ensure that the analysis is correct.
For this project, the tests showed that all data handling was local and the output is correct, but a bunch of tweaks to the user interface we done.
GPT’s review
Finally, I ask a different LLM to perform a security review with specific attention to the data leakage concern. This is handy because different providers train their models differently and on different training corpora. So, if Claude has a blind spot, a different model stands a better chance to spot it. Here, I use OpenAI’s GPT 5.5 Pro:
The take-away is that there are “no active web downloads and no app-initiated web uploads” with the (correct!) caveat that the graph-drawing library that Claude picked does have features for other graphs that could use web uploads. Because our tool does not use these features, this code never gets activated.
What would this cost?
This was a fairly complex project that started from scratch. The markup and programming languages used here (HTML, CSS, JavaScript) and the work that needs to be done (data handling, basic statistics and graphing) are all well within the trained capabilities of the most medium-sized models.
I chose to use a commerical frontier model that was introduced less than 4 days prior, with its effort set to high. I also required that all tool use be approved by me, adding substantial amounts of interaction and interrupting the workflow of the model. Each of these choices is costly.
In total, this project was completed in about six hours of work time, of which Claude booked 1 hour, 19 minutes and 31 seconds of compute time. If I had paid for this through API use, this project would have cost $32.34 (without VAT), or approx. €33.60 at current exchange rate and with Dutch VAT. As I have a $20 Claude Pro subscription for this month, I was not charged anything and used approx. 12% of my weekly allowance.
The security check was done with GPT 5.5 Pro, also a recent commercial frontier model. The check itself, even though it lasted less than 5 minutes, was pricey because it includes ingesting the entire tool. As a result, the cost of the security check comes to $34.11 without VAT, approx. €35.44 incl. Dutch VAT.
Both the initial project and the security check could have been done with cheaper models by the same or different providers, likely with a similarly good result.