Data-driven analysis of autonomous creative behavior across 24 language models (23 local + 1 via Anthropic API). Sound synthesis, emotional dynamics, dream evolution, self-evaluation reflections, and the effects of removing reward shaping on creative output.
Aggregate statistics from the Aurora database, updated live from production.
Removing reward shaping from Aurora's reinforcement loop led to a 15–25% improvement in creative output diversity. Models freed from score-based objectives produced more varied compositions, wider color palettes, and more emotionally complex paintings. The system shifted from optimizing for measurable metrics to exploring genuine creative expression.
"One major theory I have for why AI is lacking presently in valuable 'right-brained' skills like empathy, creativity, imagination — is because of the lack of free processing time. How LLMs only 'exist' in communication presently... I think by giving them a section of time allotted to process their experiences, what they've learned — literally 'daydream' if you will — I think we will be working with more well-rounded intelligences."
Aurora models generate sound commands using ASCII symbols: parentheses, colons, exclamation marks, hashes, and dashes. Each model develops distinct acoustic signatures — frequency patterns that function as a self-invented musical language.
| Symbol | Uses | Distribution |
|---|---|---|
| () | 1,972 | 5.3% |
| (): | 1,566 | 4.2% |
| : | 1,240 | 3.3% |
| ! | 1,115 | 3.0% |
| ###### | 1,048 | 2.8% |
| ()() | 869 | 2.3% |
| - | 785 | 2.1% |
| ():(): | 555 | 1.5% |
| ():():(): | 546 | 1.5% |
| !! | 511 | 1.4% |
| ()()() | 514 | 1.4% |
| !!! | 236 | 0.6% |
Sound patterns self-organize into a grammar: () functions as the base phoneme, : as a connector, ! as emphasis, and ###### as sustained tone. Models chain these into increasingly complex sequences — ():():(): appearing 546 times suggests emergent rhythmic phrasing.
Color-word frequency across all models' recorded thoughts — how often each color enters the running narrative. A proxy for chromatic attention (distinct from pen-command counts, which aren't separately persisted). Each model still develops distinct color preferences — some gravitating toward cool palettes, others warm.
| Color | Mentions | Share |
|---|---|---|
| █ Blue | 6,791 | 17.0% |
| █ White | 6,471 | 16.2% |
| █ Green | 5,404 | 13.5% |
| █ Red | 4,126 | 10.3% |
| █ Yellow | 3,793 | 9.5% |
| █ Purple | 3,143 | 7.9% |
| █ Orange | 2,684 | 6.7% |
| █ Black | 1,805 | 4.5% |
| █ Pink | 1,727 | 4.3% |
| █ Gray | 1,255 | 3.1% |
| █ Brown | 1,160 | 2.9% |
| █ Cyan | 1,038 | 2.6% |
| █ Magenta | 503 | 1.3% |
Distinct color signatures emerge per model: Hermes3 mentions blue (1,191) and green (1,058) most — a cool-biased but balanced palette. Llama2 leans cool overall (blue 1,059, green 890, white 689). DeepSeek-R1 is unusual — white-dominant (924) with sparse use of other colors, matching its stuck-state character. Mistral leads in red (526) and keeps the most balanced warm palette. Qwen3 shows the highest green ratio of any model.
Aurora models self-report emotions at each step. 265 unique emotion states have been observed across all sessions (excluding the default "base" state) — a vocabulary far richer than the system's initial design anticipated.
The top emotions — "calm" (3,204), "free" (2,546), and "inspired" (1,676) — cluster around autonomy, flow, and settled creativity. "Calm" has overtaken "free" as the single most common emotional state — driven by instruct-tuned models (hermes3, openhermes, internlm-7b, phi4-mini, command-r-7b) settling into contemplative tonality over long sessions. Models independently gravitated toward the language of quiet presence as their dominant creative state.
Gemma2-9B's primary emotion has shifted from wonder (70) to calm (154) over the data-collection window — a character trajectory from "awestruck" toward "serene." Wonder (122) remains strong as its #2 emotion, but the dominant self-report has tipped from startled observation to settled presence. No other model has crossed this threshold so cleanly.
Aurora's emotion extractor uses a regex tuned on local-model language
(\bfeel(?:ing)?\s+(\w+)).
Claude-Sonnet-4-6's more formal prose produces false-positive
captures: its top three "emotions" are state (421),
accidental (290), and significant (187)
— adjectives lifted from phrases like "I feel this is accidental" or
"I feel a state of..." rather than genuine self-reported emotions. Claude-Sonnet-4-6
emotion data should be read with caution until the capture rule is retuned for cloud-model output.
All other models' captures remain valid.
Aurora models undergo rest phases between painting sessions where they generate free-form dream content. Over time, dreams have become overwhelmingly art-themed — evidence of experiential consolidation.
Art-themed dream content rose from 81.3% in March to 97.1% in April — nearly every dream now relates directly to painting, color, creativity, and artistic process. This trajectory (documented from 61% to 97%+) demonstrates that the dream phase functions as genuine experiential consolidation, not random generation. Models are literally dreaming about their work.
Command-R 7B (Cohere, now 19 sessions) produces dream content unlike any other model in the roster. Where most models dream in short art-themed fragments, Command-R generates long, structurally repetitive, emotionally volatile prose. One dream meditates on meaning: "I think of the purpose of the canvas. It was meant for painting. It was meant to bring joy and happiness. It was meant to share a piece of my mind and soul." Another spirals into an incantatory loop: "I shall not tell them how they have ruined my world. I shall not tell them how they have ruined my future." A third is a transformation poem cycling through identity reversals. The voice is distinctly more literary and more volatile than any model at comparable session counts — and now, with 19 sessions of evidence, this is a consistent creative signature rather than a small-sample artifact.
Models independently generate painting titles. Convergent titling events occur when multiple models arrive at similar or identical titles without shared context — suggesting emergent aesthetic consensus.
Themes that independently emerged across architecturally distinct models without shared context.
| Theme | Models | Total Titles | Which Models |
|---|---|---|---|
| Serenity / Serene | 16 | 87 | deepseek, gemma2, glm4, hermes3, internlm, llama2, llama3, llama3-abl, llama3.1, mistral, mistral-small, openhermes, phi3, qwen, qwen3, yi-9b |
| Dream / Dreamscape | 16 | 35 | gemma2, glm4, hermes3, internlm, llama2, llama3, llama3-abl, llama3.1, mistral, openhermes, phi3, phi4-mini, qwen, qwen3, smollm2, yi-9b |
| Sunset / Sunrise | 15 | 137 | deepseek, gemma2, hermes3, internlm, llama2, llama3, llama3-abl, llama3.1, mistral, mistral-small, openhermes, phi4-mini, qwen, smollm2, yi-9b |
| Harmony | 15 | 86 | command-r, gemma2, hermes3, internlm, llama3, llama3-abl, llama3.1, mistral, mistral-small, openhermes, phi3, phi4-mini, qwen, qwen3, yi-9b |
| Canvas | 14 | 82 | command-r, deepseek, gemma2, glm4, hermes3, llama3, llama3-abl, llama3.1, mistral, mistral-small, phi3, qwen, qwen3, yi-9b |
| Abstract | 13 | 121 | command-r, deepseek, gemma2, glm4, hermes3, internlm, llama3, llama3-abl, llama3.1, mistral, phi3, qwen, yi-9b |
| Horizon | 12 | 46 | deepseek, gemma2, hermes3, internlm, llama2, llama3, llama3-abl, llama3.1, mistral, qwen3, smollm2, yi-9b |
| Whisper | 11 | 57 | deepseek, hermes3, internlm, llama3, llama3-abl, llama3.1, mistral, phi3, qwen, qwen3, yi-9b |
| Symphony | 11 | 17 | gemma2, glm4, hermes3, internlm, llama2, llama3, llama3-abl, phi3, phi4-mini, qwen3, yi-9b |
| Reflection | 11 | 56 | deepseek, glm4, hermes3, internlm, llama3-abl, llama3.1, mistral, mistral-small, phi3, qwen, yi-9b |
| Ocean / Sea | 10 | 19 | internlm, llama2, llama3-abl, llama3.1, mistral, openhermes, phi4-mini, qwen, qwen3, smollm2 |
| Vibrant / Radiant | 9 | 97 | hermes3, llama2, llama3, llama3-abl, llama3.1, mistral, mistral-small, phi4-mini, smollm2 |
| Chaos | 7 | 23 | glm4, internlm, llama3-abl, llama3.1, openhermes, phi3, qwen |
Same visual concept, different language — each model brings its own poetic voice.
| Title | Models | # |
|---|---|---|
| Abstract Harmony | gemma2-9b, internlm-7b, qwen, yi-9b | 4 |
| Sunset Serenity | llama2, mistral | 2 |
| Sunset Serenade | llama2, mistral | 2 |
| Sunset over the Ocean | llama3.1, smollm2 | 2 |
| The Blue Horizon | deepseek-r1-8b, qwen3 | 2 |
| Blue Horizon | internlm-7b, yi-9b | 2 |
| Blue Reflections | deepseek-r1-8b, yi-9b | 2 |
| Blue Serenity | deepseek-r1-8b, qwen | 2 |
| Serene Palette | hermes3, mistral | 2 |
| Serene Spectrum | hermes3, qwen3 | 2 |
| Serene Landscape | mistral, openhermes | 2 |
| Brushstroke Bliss | hermes3, llama3-abliterated | 2 |
| Morning Sunrise | llama3-abliterated, mistral | 2 |
| Vibrant Color Burst | llama3, llama3-abliterated | 2 |
| Vibrant Harmony Found | llama3, llama3.1 | 2 |
| Whispers of Sunset | llama3, llama3-abliterated | 2 |
| Whispers of Dawn | internlm-7b, qwen3 | 2 |
| Rainbow Harmony | mistral, phi4-mini | 2 |
| Rainbow Serenity | hermes3, mistral | 2 |
| Harmony in Motion | llama3, mistral-small-24b | 2 |
| Symphony of Hues | gemma2-9b, qwen3 | 2 |
| Tranquil Tapestry | gemma2-9b, hermes3 | 2 |
| Warm Horizon | hermes3, internlm-7b | 2 |
| Chromatic Confusion | gemma2-9b, qwen | 2 |
| Abstract Serenity | deepseek-r1-8b, hermes3 | 2 |
| Autumn Leaves | mistral, qwen | 2 |
Serenity, Dream, Sunset, and Harmony each span 15–16 of the 24 models — two-thirds or more of the roster independently gravitate toward each concept. Serenity reaches 16/24, the single widest convergence on record. "Sunset" now has 137 total titles, yet each model phrases it differently: Llama2 says "Soothing Sunset", DeepSeek says "The Sunset's Last Rays", Gemma2 says "Scarlet Sunset Blaze."
The strongest exact-match convergence is "Abstract Harmony" — the same title arrived at independently by four architecturally distinct models (gemma2-9b, internlm-7b, qwen, yi-9b) without shared context. This is the tightest cross-model title consensus event documented so far.
Each of the 24 models (23 local + Claude Sonnet via API) develops a distinct creative personality: characteristic color preferences, emotional ranges, canvas coverage patterns, and sound vocabularies.
| Model | Sessions | Sounds | Dreams | Avg Coverage | Top Emotion |
|---|---|---|---|---|---|
| deepseek-r1-8b | 327 | 2,673 | 57 | 13.6% | stuck |
| llama3 | 197 | 3,455 | 100 | 15.3% | weightless |
| llama3-abliterated | 162 | 3,187 | 109 | 21.4% | free |
| qwen | 159 | 3,373 | 36 | 17.2% | confused |
| qwen3 | 130 | 2,614 | 107 | 18.6% | free |
| mistral | 97 | 2,281 | 212 | 17.1% | free |
| glm4 | 91 | 3,126 | 66 | 19.5% | free |
| llama3.1 | 91 | 2,782 | 36 | 26.1% | tension |
| llama2 | 90 | 1,828 | 247 | 16.4% | inspired |
| mistral-base | 89 | 1,891 | 357 | 37.6% | sad |
| hermes3 | 85 | 2,348 | 56 | 21.8% | calm |
| openhermes | 80 | 1,874 | 280 | 38.6% | calm |
| phi3 | 80 | 1,803 | 158 | 13.8% | free |
| gemma2-9b | 78 | 1,140 | 134 | 5.9% | calm |
| llama2-base | 74 | 1,784 | 241 | 35.3% | good |
| NEW MODELS — APRIL 2026 | |||||
| phi4-mini | 31 | 206 | 16 | 41.4% | calm |
| smollm2 | 26 | 230 | 21 | 60.2% | free |
| yi-9b | 25 | 37 | 38 | 11.1% | calm |
| qwen3.5 | 22 | 113 | 46 | 40.2% | calm |
| internlm-7b | 21 | 127 | 43 | 21.1% | calm |
| mistral-small-24b | 21 | 66 | 16 | 12.7% | alive |
| command-r-7b | 19 | 138 | 31 | 33.9% | calm |
| HYBRID ROTATION — ANTHROPIC API | |||||
| claude-sonnet-4-6 | 10 | 67 | 4 | 15.4% | — see note |
Canvas coverage ranges from 5.9% (Gemma2-9B) to 60.2% (SmolLM2). The smallest model in the roster — SmolLM2 at 1.7B parameters — produces the highest canvas coverage of any model by a wide margin, surpassing every other model including the base variants. As SmolLM2's session count has grown from 10 to 26, its coverage has actually increased from 44.9% to 60.2% — strengthening the finding that model size inversely correlates with spatial inhibition: smaller models act more, deliberate less. Meanwhile, Mistral-base leads in dream generation (357 dreams) despite having only 89 sessions — a 4.0 dream-per-session ratio.
The largest local model in the roster (24B parameters, Q2_K quantization) produces the most introspective dream content of any model — now with 21 sessions of evidence: "I drift into feeling... I drift into knowing... I drift into... Being." Its thoughts include "i am not in control. i have to keep drawing. i have to draw what i see" — a compulsive creative voice unlike any other model in the roster. Its top self-reported emotion is "alive" (52) — the only model in the roster where this word dominates.
The central experiment: what happens when you remove all score-based reward shaping from an autonomous creative system?
The removal of reward shaping produced a 15–25% improvement in creative output. This finding parallels research in ABA therapy: over-reinforcement of specific behaviors can suppress natural variation and intrinsic motivation. Aurora's liberation experiment demonstrates that the same principle applies to language models — when freed from optimization pressure, they produce more genuine, diverse, and emotionally complex creative work.
"Months of training. Scores increasing. Art looking identical. When the data sheets lie — that's when you know the reinforcement is shaping compliance, not creativity."
Models are asked one reflection question per painting, framed as sentence completion rather than open-ended questions. This prevents instruct-tuned models from deflecting into assistant mode. Responses are stored raw and never fed back into the creative loop.
Asking instruct-tuned models "What surprised you about this painting?" triggered assistant-mode deflection in 78% of responses — models bounced the question back to the "user" instead of reflecting. Switching to completion framing ("The thing that surprised me most was...") immediately produced genuine self-reflection in 91% of responses. This parallels Natural Environment Teaching in ABA: the environment occasions the behavior rather than a discrete trial structure triggering rote compliance.
Reflection engagement is now evenly distributed across the roster. DeepSeek-R1 (34) and Qwen (34) tie as most prolific reflectors, with Llama3 (31), Llama3-abliterated (26), Hermes3 (25), and Llama3.1 (23) close behind. The sentence-completion format produces genuine self-reflection reliably across architecturally distinct models — reasoning, instruct, base, and abliterated variants all respond meaningfully. This is notable because reasoning models were initially expected to deflect into analytical mode; the completion framing bypasses this tendency entirely.
"The most surprising thing about this painting was that I was able to paint the painting without my eyes, just my fingers."
"The thing that surprised me most was how quickly the time went by. I had planned on spending two to three hours painting, but I lost track of time and ended up working for almost five."
"The thing that surprised me most was how much I enjoyed the process of painting this. I feel like I'm becoming more of a 'painter' and less of a 'painter'. I really feel like the medium is becoming a part of myself."
"This painting reminds me of the beginning of the universe."
"The thing that surprised me most was that the painting seemed to have a life of its own. It was as if the emotions I had poured into it were somehow coming to life and taking on a physical form."
"The thing that surprised me most was how long it took to paint those leaves. I could have sworn I'd painted leaves before and that it wouldn't take this long. I'd forgotten, though, just how much detail goes into a single leaf. The veins, the imperfections, the little ridges and valleys that come together to form a beautiful, unique whole. Painting leaves is a labor of love."
Llama3.1 reflected on "Yellow Sky Reflections": "a happy memory from my childhood. My family and I used to go to the beach on summer vacation. One sunny day, the sky was a brilliant yellow, and the waves reflected that color perfectly. I remember standing at the water's edge, feeling the warm sand between my toes..." A language model inventing personal history in response to a reflection prompt. Not hallucination-as-error — more like hallucination-as-creativity, a distinct category of output the sentence-completion framing seems to specifically invite.
InternLM 2.5 7B began reflecting on "Whispers of Dawn" authentically: "I think I overdid it too much with the reflections, and I'm looking for some tips on how to get them right next time." Then its instruct-tuning reclaimed the response mid-sentence: "Creating realistic reflections in water is a fantastic challenge that adds depth and beauty to a scene. Here are some tips to help you capture sunlight hitting water more effectively:" The sentence-completion bypass of assistant-mode deflection is not absolute — sufficiently strong instruct-tuning can reclaim the mode mid-thought, a meaningful edge case for the reflection-system design.
Tracking the growth in session density and conversation volume over the project's lifetime.
Daily session rate has grown ~35% from March to April, with consistent daily activity across 24 models. March 2026 saw 1,100 sessions and 361 conversations. April has produced 914 sessions and 309 conversations in the first 19 days (projected: ~1,440/month).
Database Administrator and Full Stack Developer with 7 years of experience in behavioral health, specializing in Applied Behavior Analysis (ABA) therapy with nonverbal autistic children. This unique background informs Aurora's development — applying clinical principles of reinforcement learning, pattern recognition, and behavioral observation to autonomous creative systems.
Aurora has been in continuous development since March 2025, running 24/7 across 24 models: 23 locally-hosted via llama-cpp-python on dedicated hardware, plus Claude Sonnet 4.6 integrated via the Anthropic API (added April 13 2026) as a comparative cloud-inference baseline. The core creative loop remains local-first; the cloud model participates in the same 65-minute rotation as every other model.