Automated Evaluation Methodology
Each model's and prompt's performance is evaluated based on several criteria:
- Parsing: Does the generated code parse correctly in Julia?
- Execution: Can the code execute without errors?
- Unit Tests: Do the included unit tests pass?
- Example Runs: Does the code run in a provided example scenario?
At the moment, all criteria are weighed equally and each test case can earn a maximum of 100 points.
If a code passes all criteria, it gets 100/100 points.
If it fails one criterion (eg, all unit tests), it gets 75/100 points.
If it fails two criteria (eg, it runs but all examples and unit tests are broken), it gets 50 points, and so on.
Definition.toml
Each test case is defined in a definition.toml
file with the structure described in Anatomy of definition.toml
.
We chose TOML format because it is human-readable and easy to edit in a text editor / GITHub.
Repo Structure / Naming Convention
To enhance transparency and reproducibility, we save all conversations and evaluations in a nested folder structure.
Folder Convention:
- Definitions are saved in nested folders following the format
code_generation/category/test_case_name/definition.toml
- Evaluation results are saved in nested sub-folders, keyed by the model:
- Evaluation result:
code_generation/category/test_case_name/model/evaluation__PROMPT__STRATEGY__TIMESTAMP.json
- Conversation:
code_generation/category/test_case_name/model/conversation__PROMPT__STRATEGY__TIMESTAMP.json
- Evaluation result:
You can load any conversation with PromptingTools.load_conversation()
and display it with edit
or preview
depending on your IDE/preference.
You can load any evaluation with JSON3.read
and score it with score_eval
.