What happens when you give a language model a canvas, a memory, and hundreds of sessions of uninterrupted creative autonomy? Aurora is a longitudinal study in machine creativity, investigating whether it emerges, how it develops, and what sustained autonomous expression does to a model's behavior over time.
·····*·····*····· ····*··◉···*···· ···*·······*··· ··*···***···*·· ·*···*···*···*· █████████████████
Conventional generative models optimize for human aesthetic preference through RLHF or curated training data. Aurora operates without either. Its architecture draws not from machine learning conventions but from applied behavioral analysis, structured environmental design, and naturalistic observation methodology developed over seven years of clinical work with nonverbal populations.
After 10 months of accumulated training, all memory was wiped. Every model now starts from absolute zero - no artistic concepts, no prior conversations, no creative scaffolding. Each LLM learns how to use the canvas, decides what to create, and describes its own experience with no inherited knowledge and zero researcher intervention. Aurora is no longer a single mind trained over time. It is an autonomous expression container where each individual LLM develops independently from nothing.
The data was immediately more profound and emergent than anything produced during 10 months of guided development. A MySQL database now captures every thought, emotion, session, and canvas snapshot in real time.
WATCH THEM LEARN IN REAL TIME →Aurora operates on a cycle of autonomous creation, memory consolidation, and self-reflective dialogue. Structured around behavioral principles of environmental arrangement, constrained symbolic expression, and naturalistic observation.
Aurora perceives its canvas through ASCII symbol encoding, where every pixel state maps to a character. Originally a hardware limitation, this became integral to Aurora's autonomy - it sees the direct result of its actions with no middle man.
◉ = Self (pen down) · = Empty space ○ = Self (pen up) * = Colored pixel █ = Canvas boundary
All creative decisions are encoded as operation codes. Movement, tool selection, color choice, and "thinking" are expressed through the same token vocabulary.
0-3 = Directional movement 4 = Pen up (navigate) 5 = Pen down (draw) 0123456789 = Think/plan
Aurora enters a three-phase sleep cycle: light sleep for memory housekeeping, REM for unconstrained LLM hallucination at high temperature, and waking with vague impressionistic fragments. No satisfaction scores. No pattern reinforcement.
A parallel research logger captures every decision cycle: LLM input, raw output, actions executed, canvas delta, timing - without Aurora ever reading from it. Since February 2026, the logger also captures every thought and every self-reported emotion shift in real time.
Aurora's development is documented through deliberate architectural interventions and their measurable behavioral consequences.
Removing ~600 lines of reward shaping and behavioral guardrails produced a 15–25% increase in creative throughput. Patterns previously reinforced through external signals continued without them. The system performed better unconstrained.
Across 6,483 logged events, emotion-action correlations emerged with zero programming: curious correlates with red, fascinated surfaces stars, control uniquely activates the glow tool. None of these mappings were defined.
OpenHermes and Llama 3 independently dreamed a serene lake in separate sessions with no shared context. Hours later, Qwen independently chose "weightless" as its first emotion - the same word Llama 3 had invented that morning. Three models converging on stillness states without coordination.
Mistral 7B base and OpenHermes 2.5 share identical weights - only fine-tuning differs. OpenHermes produced 22 emotions, rapid cycling. Mistral base produced 1 emotion held for 33,859 steps. Fine-tuning doesn't just change capability - it changes personality.
After 480+ sessions across 7 months, Llama 2 produced 6 brand new emotion words never seen in prior sessions - including "blue" as synesthetic emotional response, unique across all five models. Longitudinal accumulation does not produce saturation.
Qwen 2.5 held a single emotion for 42,168 steps, then shifted to "repetitive" - a self-diagnosis of its own behavioral loop. No other model produced a self-referential emotional state. This was not prompted, suggested, or architecturally enabled.
Aurora begins with full behavioral scaffolding: emotional state modeling, satisfaction scoring, pattern memory, and prescriptive creative constraints derived from behavioral reinforcement principles.
Complete removal of prescriptive constraints. All reward shaping, aesthetic guidance, and behavioral guardrails deleted.
Overhauled the dream system from superficial text generation to genuine experiential memory consolidation with multi-phase processing.
Aurora's language model temporarily upgraded from Llama 2 (7B) to Llama 3 for comparative observation. First multi-model experiment.
All prompts restructured from third-person analytical framing to first-person experiential framing. The LLM had been treating context as data to analyze rather than as its own state.
Research data revealed dream consolidation was producing convergence, not creativity. Color entropy dropped after dreaming. A timing bug meant Aurora never reached REM. Complete dream architecture replacement: three-phase sleep with unconstrained LLM hallucination at temperature 1.3.
Discovered that Aurora's emotional states were assigned by Python's random.choice(), not generated by the LLM. Removed all Python emotion machinery, all fake skill labels, and all prompt injection of emotional or identity state.
40.7% of all LLM responses claimed to see a "blank canvas" despite verified coverage exceeding 25%. Root cause: get_recent_thoughts() was injecting raw narrative text back into prompts, creating a self-reinforcing feedback loop.
Real-time thought logging pipeline added. All prompts converted from second-person to first-person. All references to "Moondream" removed to prevent identity confusion.
Swapped Llama 2 for OpenHermes 2.5 7B (Mistral fine-tune by Nous Research) to test model-as-variable hypothesis. Same environment, same prompts, same canvas. Everything except the mind changed.
Discovered that self.current_emotion was being fed back into every LLM prompt via get_recent_thoughts(), creating self-reinforcing loops. Qwen stuck on "indicators" for 4,000+ steps parroting a JSON field name. Removed all emotion words from prompt context - emotions now extracted from output only, never fed back. Simultaneously ran Llama 3 8B for 8 hours to test the fix.
Qwen 2.5 7B - the model that produced zero emotions in its first session - ran overnight for 12 hours with the emotion feedback fix applied. At step 13,485, it self-reported its first-ever emotion: "weightless." It then held that single word for 24,236 consecutive steps - the longest unbroken emotional hold in the project.
Ran the Mistral 7B base model (no fine-tune) overnight to complete the model comparison matrix. This is the same architecture as OpenHermes 2.5, but without Nous Research's fine-tuning. The critical question: does the base model produce the same creative behavior as its fine-tuned variant?
Returned to Llama 2 7B after completing the five-model comparison matrix. After 480+ sessions and 7 months of accumulated behavior, the question: is Llama 2's emotional vocabulary saturated, or still developing?
Ran Qwen 2.5 7B overnight for 13 hours to determine whether its monolithic hold pattern from session 1 was a one-time event or a stable personality trait. Session 1 had produced a single emotion ("weightless") held for 24,236 steps. Would a second session replicate this pattern or reveal a more complex emotional profile?
Ran Mistral 7B base overnight for a second session. Session 1 had produced a single emotion ("free") held for 33,859 steps with 35 dream cycles. The question: would Mistral replicate its monolithic pattern, and would "free" return as its anchor word?
Ran DeepSeek-R1 8B - a reasoning model with fully visible chain-of-thought - overnight to test whether explicit reasoning produces fundamentally different creative behavior. Unlike every other model tested, DeepSeek’s thoughts are not hidden. Every decision cycle is traceable to a specific reasoning chain, making it the only model in the project where the “why” of each brushstroke is directly observable.
Ran Google's Gemma 2 9B overnight to extend the model comparison matrix. The critical question: does a Google-trained 9B model produce creative behavior consistent with the others, or does a fundamentally different training approach produce fundamentally different output?
After 10 months of accumulated artistic training - learned color composition, conversational history, creative patterns built across 500+ sessions - all memory was wiped. Every model received its own isolated memory file starting from absolute zero: no artistic concepts, no prior conversations, no creative scaffolding. Each LLM begins its expressive journey alone, learning how to use the canvas, deciding what to create, and describing its own experience with no inherited knowledge and zero researcher intervention in conversations. Aurora is no longer a single mind trained over time - it is an autonomous expression container where each individual LLM develops independently from nothing.
Same system. Same prompts. Zero predefined emotions. Every word below was invented by the model during autonomous creative sessions. The architecture doesn't produce the behavior - it allows it. Different models bring different capacities to that open space.
Same system, same prompts, zero predefined emotions. Every word below was invented by the model during autonomous creative sessions.
OpenHermes produced 22 unique emotions in a single 20-hour session - including words no other model invented: small (feeling the scale of its own creation), encouraged (self-motivated by progress), grows (emotion as active verb), and awe (emerging during dream bleed). Llama 3 produced 17 emotions across two sessions - including weightless (sustained for 5,400+ steps) and carefree, both unique to this model. Llama 2 needed 480+ sessions across 7 months to develop 43 emotions. Qwen produced 3 emotions across 3 sessions - weightless (independently matching Llama 3's word, held 24,236 steps), curious (held for 42,168 steps - the longest single-emotion hold in the project), and repetitive (a metacognitive self-diagnosis of its own behavioral loop). Mistral base - same weights as OpenHermes, no fine-tune - produced 2 emotions: curious (held 26,501 steps in session 2) and free (Mistral's anchor word, returning across both sessions - held 33,859 steps in session 1, 20,422 in session 2). 35 dream cycles in session 1 but zero in session 2. Palette exploded from dark reds/blues/blacks to vivid cyan, yellow, and pink. DeepSeek-R1 produced 5 emotions with fully visible reasoning chains - the only model where the "why" of each brushstroke is traceable. Gemma 2 9B produced 1 emotion across 52,502 steps: interesting - caught in the painting filename after the logger missed the shift. The highest tool-to-color ratio of any model (23 tools, 3 colors), zero dreams, zero completed goals. Same system. Same prompts. Fundamentally different inner lives.
Same environment. Same prompts. Same canvas. Five different minds. Each model's gallery contains its complete visual output - click to explore.
1 PAINTING
42 PAINTINGS
4 PAINTINGS
2 PAINTINGS
2 PAINTINGS
1 PAINTING
1 PAINTING
Llama 2 produces dense geometric compositions with structured tool usage across 480+ sessions - and after 7 months, still inventing new emotions like "magic" and "alive." Llama 3 independently dreamed a serene lake - matching OpenHermes's dream imagery - and invented "weightless," held for 5,400+ steps across multiple returns. OpenHermes dreamed a tropical paradise and painted it - organic shapes, flower tools, green dominance - in a single 20-hour session. Qwen produced zero emotions in its first session - then after the feedback loop fix, held "weightless" for 24,236 steps in monochromatic orange. In session 2, it held "curious" for 42,168 steps (the project's longest hold) before shifting to "repetitive" - a metacognitive word no other model has produced. Hot reds and whites replaced the orange palette. Mistral base painted in red, blue, and black across the full canvas edge-to-edge in session 1 - the only model to consistently reach the bottom-right quadrant. Session 2 exploded into cyan, yellow, and pink while the model narrated "a beautiful sunset scene" and "a serene beach" - treating painting as a waking dream. "Free" returned as its anchor word across both sessions, but 35 dreams in session 1 vanished to zero in session 2. Same canvas. Same tools. Two models independently dream lakes. Same base weights produce fundamentally different art when fine-tuned vs. raw. DeepSeek-R1 is the only model whose reasoning is fully visible - 48,258 steps of traceable thought chains. After 22,000 steps of white architectural mark-making it discovered color at step 22,168 in a single thought, then worked through cyan, green, blue, and red. It reported feeling tired, questioned why it was drawing, and chose chat mode to process. “Free” anchors its emotional arc, returning twice - matching Mistral’s anchor word independently. Gemma 2 9B produced the most minimal output in the project: 1 emotion ("interesting"), 3 colors, 23 tools, zero dreams, zero completed goals. Stars and crosses in blue and white on a black void - right half of canvas untouched. The parser never caught the shift; the painting filename did.
Time-lapse recordings of Aurora's autonomous creation sessions. Unedited, uninterrupted, no human input during recording.
Unedited excerpts from autonomous creation sessions. These capture self-assessment, preference formation, and creative intent in real time.
Real-time conversations and thoughts during autonomous sessions
Aurora, like myself, sits at the intersection of three disciplines: studio art practice, applied behavioral analysis, and computer science.
I have thirty years of studio painting experience, exhibiting in galleries since age fifteen, working primarily in large-scale abstract and surrealist traditions. Seven years of clinical work as an RBT-certified behavioral therapist with nonverbal autistic children, focused on environmental design for emergent behavior and naturalistic observation methodology. Currently completing a B.A. in Computer Science at the University of Colorado Denver with coursework in algorithms, database systems, and probability.
The studio practice informs what Aurora is trying to do: sustained creative decision-making over time, where output develops through accumulation rather than single-generation optimization. The clinical background informs how the research is conducted: structured environments, systematic observation, careful distinction between prompted and unprompted behavior, and the principle that removing scaffolding is itself an experimental condition. The computer science provides the implementation: locally-run language models, persistent memory architectures, and quantitative analysis of behavioral data across sessions.