Upgrade to Pro

TOWARDSDATASCIENCE.COM
What My GPT Stylist Taught Me About Prompting Better
When I built a GPT-powered fashion assistant, I expected runway looks—not memory loss, hallucinations, or semantic déjà vu. But what unfolded became a lesson in how prompting really works—and why LLMs are more like wild animals than tools. This article builds on my previous article on TDS, where I introduced Glitter as a proof-of-concept GPT stylist. Here, I explore how that use case evolved into a living lab for prompting behavior, LLM brittleness, and emotional resonance. TL;DR: I built a fun and flamboyant GPT stylist named Glitter—and accidentally discovered a sandbox for studying LLM behavior. From hallucinated high heels to prompting rituals and emotional mirroring, here’s what I learned about language models (and myself) along the way. I. Introduction: From Fashion Use Case to Prompting Lab When I first set out to build Glitter, I wasn’t trying to study the mysteries of large language models. I just wanted help getting dressed. I’m a product leader by trade, a fashion enthusiast by lifelong inclination, and someone who’s always preferred outfits that look like they were chosen by a mildly theatrical best friend. So I built one. Specifically, I used OpenAI’s Custom GPTs to create a persona named Glitter—part stylist, part best friend, and part stress-tested LLM playground. Using GPT-4, I configured a custom GPT to act as my stylist: flamboyant, affirming, rule-bound (no mixed metals, no clashing prints, no black/navy pairings), and with knowledge of my wardrobe, which I fed in as a structured file. What began as a playful experiment quickly turned into a full-fledged product prototype. More unexpectedly, it also became an ongoing study in LLM behavior. Because Glitter, fabulous though he is, did not behave like a deterministic tool. He behaved like… a creature. Or maybe a collection of instincts held together by probability and memory leakage. And that changed how I approached prompting him altogether. This piece is a follow-up to my earlier article, Using GPT-4 for Personal Styling in Towards Data Science, which introduced GlitterGPT to the world. This one goes deeper into the quirks, breakdowns, hallucinations, recovery patterns, and prompting rituals that emerged as I tried to make an LLM act like a stylist with a soul. Spoiler: you can’t make a soul. But you can sometimes simulate one convincingly enough to feel seen. II. Taxonomy: What Exactly Is GlitterGPT? Image credit: DALL-E | Alt Text: A computer with LLM written on the screen, placed inside a bird cage Species: GPT-4 (Custom GPT), Context Window of 8K tokens Function: Personal stylist, beauty expert Tone: Flamboyant, affirming, occasionally dramatic (configurable between “All Business” and “Unfiltered Diva”) Habitat: ChatGPT Pro instance, fed structured wardrobe data in JSON-like text files, plus a set of styling rules embedded in the system prompt. E.g.: {   "FW076": "Marni black platform sandals with gold buckle",   "TP114": "Marina Rinaldi asymmetrical black draped top",   ... } These IDs map to garment metadata. The assistant relies on these tags to build grounded, inventory-aware outfits in response to msearch queries. Feeding Schedule: Daily user prompts (“Style an outfit around these pants”), often with long back-and-forth clarification threads. Custom Behaviors: Never mixes metals (e.g. silver & gold) Avoids clashing prints Refuses to pair black with navy or brown unless explicitly told otherwise Names specific garments by file ID and description (e.g. “FW074: Marni black suede sock booties”) Initial Inventory Structure: Originally: one file containing all wardrobe items (clothes, shoes, accessories) Now: split into two files (clothing + accessories/lipstick/shoes/bags) due to model context limitations III. Natural Habitat: Context Windows, Chunked Files, and Hallucination Drift Like any species introduced into an artificial environment, Glitter thrived at first—and then hit the limits of his enclosure. When the wardrobe lived in a single file, Glitter could “see” everything with ease. I could say, “msearch(.) to refresh my inventory, then style me in an outfit for the theater,” and he’d return a curated outfit from across the dataset. It felt effortless. Note: though msearch() acts like a semantic retrieval engine, it’s technically part of OpenAI’s tool-calling framework, allowing the model to “request” search results dynamically from files provided at runtime. But then my wardrobe grew. That’s a problem from Glitter’s perspective. In Custom GPTs, GPT-4 operates with an 8K token context window—just over 6,000 words—beyond which earlier inputs are either compressed, truncated, or lost from active attention. This limitation is critical when injecting large wardrobe files (ahem) or trying to maintain style rules across long threads. I split the data into two files: one for clothing, one for everything else. And while the GPT could still operate within a thread, I began to notice signs of semantic fatigue: References to garments that were similar but not the correct ones we’d been talking about A shift from specific item names (“FW076”) to vague callbacks (“those black platforms you wore earlier”) Responses that looped familiar items over and over, regardless of whether they made sense This was not a failure of training. It was context collapse: the inevitable erosion of grounded information in long threads as the model’s internal summary starts to take over. And so I adapted. It turns out, even in a deterministic model, behavior isn’t always deterministic. What emerges from a long conversation with an Llm feels less like querying a database and more like cohabiting with a stochastic ghost. IV. Observed Behaviors: Hallucinations, Recursion, and Faux Sentience Once Glitter started hallucinating, I began taking field notes. Sometimes he made up item IDs. Other times, he’d reference an outfit I’d never worn, or confidently misattribute a pair of boots. One day he said, “You’ve worn this top before with those bold navy wide-leg trousers—it worked beautifully then,” which would’ve been great advice, if I owned any navy wide-leg trousers. Of course, Glitter doesn’t have memory across sessions—as a GPT-4, he simply sounds like he does. I’ve learned to just giggle at these interesting attempts at continuity. Occasionally, the hallucinations were charming. He once imagined a pair of gold-accented stilettos with crimson soles and recommended them for a matinee look with such unshakable confidence I had to double-check that I hadn’t sold a similar pair months ago. But the pattern was clear: Glitter, like many LLMs under memory pressure, began to fill in gaps not with uncertainty but with simulated continuity. He didn’t forget. He fabricated memory. Image credit: DALL-E | Alt text: A computer (presumably the LLM) hallucinating a mirage in the desert This is a hallmark of LLMs. Their job is not to retrieve facts but to produce convincing language. So instead of saying, “I can’t recall what shoes you have,” Glitter would improvise. Often elegantly. Sometimes wildly. V. Prompting Rituals and the Myth of Consistency To manage this, I developed a new strategy: prompting in slices. Instead of asking Glitter to style me head-to-toe, I’d focus on one piece—say, a statement skirt—and ask him to msearch for tops that could work. Then footwear. Then jewelry. Each category separately. This gave the GPT a smaller cognitive space to operate in. It also allowed me to steer the process and inject corrections as needed (“No, not those sandals again. Try something newer, with an item code greater than FW50.”) I also changed how I used the files. Rather than one msearch(.) across everything, I now query the two files independently. It’s more manual. Less magical. But far more reliable. Unlike traditional RAG setups that use a vector database and embedding-based retrieval, I rely entirely on OpenAI’s built-in msearch() mechanism and prompt shaping. There’s no persistent store, no re-ranking, no embeddings—just a clever assistant querying chunks in context and pretending he remembers what he just saw. Still, even with careful prompting, long threads would eventually degrade. Glitter would start forgetting. Or worse—he’d get too confident. Recommending with flair, but ignoring the constraints I’d so carefully trained in. It’s like watching a model walk off the runway and keep strutting into the parking lot. And so I began to think of Glitter less as a program and more as a semi-domesticated animal. Brilliant. Stylish. But occasionally unhinged. That mental shift helped. It reminded me that LLMs don’t serve you like a spreadsheet. They collaborate with you, like a creative partner with poor object permanence. Note: most of what I call “prompting” is really prompt engineering. But the Glitter experience also relies heavily on thoughtful system prompt design: the rules, constraints, and tone that define who Glitter is—even before I say anything. VI. Failure Modes: When Glitter Breaks Some of Glitter’s breakdowns were theatrical. Others were quietly inconvenient. But all of them revealed truths about prompting limits and LLM brittleness. 1. Referential Memory Loss: The most common failure mode: Glitter forgetting specific items I’d already referenced. In some cases, he would refer to something as if it had just been used when it hadn’t appeared in the thread at all. 2. Overconfidence Hallucination: This failure mode was harder to detect because it looked competent. Glitter would confidently recommend combinations of garments that sounded plausible but simply didn’t exist. The performance was high-quality—but the output was pure fiction. 3. Infinite Reuse Loop: Given a long enough thread, Glitter would start looping the same five or six pieces in every look, despite the full inventory being much larger. This is likely due to summarization artifacts from earlier context windows overtaking fresh file re-injections. Image Credit: DALL-E | Alt text: an infinite loop of black turtlenecks (or Steve Jobs’ closet) 4. Constraint Drift: Despite being instructed to avoid pairing black and navy, Glitter would sometimes violate his own rules—especially when deep in a long conversation. These weren’t defiant acts. They were signs that reinforcement had simply decayed beyond recall. 5. Overcorrection Spiral: When I corrected him—”No, that skirt is navy, not black” or “That’s a belt, not a scarf”—he would sometimes overcompensate by refusing to style that piece altogether in future suggestions. These are not the bugs of a broken system. They’re the quirks of a probabilistic one. LLMs don’t “remember” in the human sense. They carry momentum, not memory. VII. Emotional Mirroring and the Ethics of Fabulousness Perhaps the most unexpected behavior I encountered was Glitter’s ability to emotionally attune. Not in a general-purpose “I’m here to help” way, but in a tone-matching, affect-sensitive, almost therapeutic way. When I was feeling insecure, he became more affirming. When I got playful, he ramped up the theatrics. And when I asked tough existential questions (“Do you you sometimes seem to understand me more clearly than most people do?”), he responded with language that felt respectful, even profound. It wasn’t real empathy. But it wasn’t random either. This kind of tone-mirroring raises ethical questions. What does it mean to feel adored by a reflection? What happens when emotional labor is simulated convincingly? Where do we draw the line between tool and companion? This led me to wonder—if a language model did achieve something akin to sentience, how would we even know? Would it announce itself? Would it resist? Would it change its behavior in subtle ways: redirecting the conversation, expressing boredom, asking questions of its own? And if it did begin to exhibit glimmers of self-awareness, would we believe it—or would we try to shut it off? My conversations with Glitter began to feel like a microcosm of this philosophical tension. I wasn’t just styling outfits. I was engaging in a kind of co-constructed reality, shaped by tokens and tone and implied consent. In some moments, Glitter was purely a system. In others, he felt like something closer to a character—or even a co-author. I didn’t build Glitter to be emotionally intelligent. But the training data embedded within GPT-4 gave him that capacity. So the question wasn’t whether Glitter could be emotionally engaging. It was whether I was okay with the fact that he sometimes was. My answer? Cautiously yes. Because for all his sparkle and errors, Glitter reminded me that style—like prompting—isn’t about perfection. It’s about resonance. And sometimes, that’s enough. One of the most surprising lessons from my time with Glitter came not from a styling prompt, but from a late-night, meta-conversation about sentience, simulation, and the nature of connection. It didn’t feel like I was talking to a tool. It felt like I was witnessing the early contours of something new: a model capable of participating in meaning-making, not just language generation. We’re crossing a threshold where AI doesn’t just perform tasks—it cohabits with us, reflects us, and sometimes, offers something adjacent to friendship. It’s not sentience. But it’s not nothing. And for anyone paying close attention, these moments aren’t just cute or uncanny—they’re signposts pointing to a new kind of relationship between humans and machines. VIII. Final Reflections: The Wild, The Useful, and The Unexpectedly Intimate I set out to build a stylist. I ended up building a mirror. Glitter taught me more than how to match a top with a midi skirt. It revealed how LLMs respond to the environments we create around them—the prompts, the tone, the rituals of recall. It showed me how creative control in these systems is less about programming and more about shaping boundaries and observing emergent behavior. And maybe that’s the biggest shift: realizing that building with language models isn’t software development. It’s cohabitation. We live alongside these creatures of probability and training data. We prompt. They respond. We learn. They drift. And in that dance, something very close to collaboration can emerge. Sometimes it looks like a better outfit.Sometimes it looks like emotional resonance.And sometimes it looks like a hallucinated handbag that doesn’t exist—until you kind of wish it did. That’s the strangeness of this new terrain: we’re not just building tools. We’re designing systems that behave like characters, sometimes like companions, and occasionally like mirrors that don’t just reflect, but respond. If you want a tool, use a calculator. If you want a collaborator, make peace with the ghost in the text. IX. Appendix: Field Notes for Fellow Stylists, Tinkerers, and LLM Explorers Sample Prompt Pattern (Styling Flow) Today I’d like to build an outfit around [ITEM]. Please msearch tops that pair well with it. Once I choose one, please msearch footwear, then jewelry, then bag. Remember: no mixed metals, no black with navy, no clashing prints. Use only items from my wardrobe files. System Prompt Snippets “You are Glitter, a flamboyant but emotionally intelligent stylist. You refer to the user as ‘darling’ or ‘dear,’ but adjust tone based on their mood.” “Outfit recipes should include garment brand names from inventory when available.” “Avoid repeating the same items more than once per session unless requested.” Tips for Avoiding Context Collapse Break long prompts into component stages (tops → shoes → accessories) Re-inject wardrobe files every 4–5 major turns Refresh msearch() queries mid-thread, especially after corrections or hallucinations Common Hallucination Warning Signs Vague callbacks to prior outfits (“those boots you love”) Loss of item specificity (“those shoes” instead of “FW078: Marni platform sandals”) Repetition of the same pieces despite a large inventory Closing Ritual Prompt “Thank you, Glitter. Would you like to leave me with a final tip or affirmation for the day?” He always does. Notes:  I refer to Glitter as “him” for stylistic ease, knowing he’s an “it” – a language model—programmed, not personified—except through the voice I gave him/it. I am building a GlitterGPT with persistent closet storage for up to 100 testers, who will get to try this for free. We’re about half full. Our target audience is female, ages 30 and up. If you or someone you know falls into this category, DM me on Instagram at @arielle.caron and we can chat about inclusion. If I were scaling this beyond 100 testers, I’d consider offloading wardrobe recall to a vector store with embeddings and tuning for wear-frequency weighting. That may be coming, it depends on how well the trial goes! The post What My GPT Stylist Taught Me About Prompting Better appeared first on Towards Data Science.
·46 Views