Score with Ocsai
Ocsai uses fine-tuned LLMs to score divergent thinking responses, achieving up to r = .81 correlation with human raters (Organisciak et al., 2023). Large files are processed in chunks with live progress.
Results
Originality ranges from 1-5, where 1 is minimally original, and 5 is maximally original.
No results yet.
Cite
Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356. https://doi.org/10.1016/j.tsc.2023.101356
Large Language Model Scoring Details
Models
- ocsai-1.6: Update to the multi-lingual, multi-task 1.5 model, trained on GPT 4o instead of 3.5.
- ocsai1-4o: GPT-4o-based model, trained with more data and supporting multiple tasks. Last update to the Ocsai 1 models (i.e. the original ones).
- ocsai-chatgpt2: GPT-3.5-size chat-based model, trained with more data and supporting multiple tasks. Scoring is slower, with slightly better performance than ocsai-davinci.
- ocsai-davinci3: GPT-3 Davinci-size model. Trained with the method from Organisciak et al. 2023, but with the additional tasks (uses, consequences, instances, complete the sentence) from Acar et al 2023, and trained with more data.
- ocsai-1.5: Beta version of new multi-lingual, multi-task model, trained on GPT 3.5.
- ocsai-chatgpt: GPT-3.5-size chat-based model, trained with same format and data as original models. Scoring is slower, with slightly better performance than ocsai-davinci2. For more tasks and trained on more data, use davinci-ocsai2
- ocsai-babbage2: GPT-3 Babbage-size model from the paper, retrained with new model API. Deprecated, mainly because other models work better.
- ocsai-davinci2: GPT-3 Davinci-size model from the paper, retrained with a new model API.
How does it work?
Ocsai uses supervised learning: models are fine-tuned on thousands of human-judged divergent thinking responses so they learn what originality looks like. Scores use a 1–5 scale, where 1 is very unoriginal, 5 is very original, and 3 is the median.
LLM-based scoring is a newer approach, and research into its strengths and edge cases is ongoing. Contact us with questions at peter.organisciak@du.edu.