
Lessons from a Year of AI Tool Experiments: What Actually Worked
- 9 minutes - Feb 8, 2026
- #ai#productivity#tools#retrospective#lessons-learned
Over the past year, I’ve been experimenting extensively with AI tools—trying to understand what they’re actually good for, where they fall short, and how to use them effectively. I’ve written about several of these experiments: the meeting scheduling failures, the presentation generation disappointments, and most recently, setting up Moltbot as an SDR.
Looking back at all these experiments, patterns emerge. Some things consistently worked. Others consistently didn’t. And a few things surprised me in both directions.
This post is a retrospective on those experiments—a synthesis of what I’ve learned about AI tools through hands-on experience rather than hype.
Pattern 1: AI Is Great at Planning, Weak at Execution
This is the most consistent pattern I’ve observed. AI tools excel at the thinking and planning phases of work but struggle with execution and implementation.
Where this showed up:
When I tried using AI for presentation creation, ChatGPT was excellent at synthesizing information, creating outlines, generating speaker notes, and suggesting visual approaches. But it couldn’t actually create the Google Slides presentation. The planning was valuable; the execution was nonexistent.
Similarly, for meeting scheduling, AI assistants could discuss scheduling strategies and suggest optimal times, but they couldn’t actually access calendars and book meetings.
For sales automation with Moltbot, the AI could research prospects, draft messages, and plan outreach sequences—but implementing those plans required significant human configuration and oversight.
Why this happens:
Planning is fundamentally a text-generation task. AI models are trained to produce coherent text that sounds like good planning. Execution requires interacting with external systems, handling edge cases, and dealing with real-world messiness that doesn’t appear in training data.
How to use this pattern:
Lean into AI for ideation, planning, and strategy work. Use it to think through problems, generate options, and draft approaches. But don’t expect it to handle the execution—that’s still your job.
Pattern 2: Integration Promises Often Don’t Deliver
A recurring disappointment has been AI tools that promise deep integration with other systems but don’t deliver meaningful capability.
Where this showed up:
Google’s Gemini should have been excellent for Google Slides and Google Calendar integration—it’s made by the same company, after all. But in practice, Gemini couldn’t create slides or schedule meetings any better than ChatGPT could. The “integration” was mostly marketing, not meaningful capability.
Even when integrations exist, they’re often shallow. AI tools might be able to read from a system but not write to it, or they might handle simple cases but fail on anything complex.
Why this happens:
True integration requires not just API access but understanding of the target system’s constraints, permissions, and edge cases. Most AI integrations are bolted on after the fact rather than deeply designed, resulting in capabilities that work in demos but fail in practice.
How to use this pattern:
Be skeptical of integration claims until you test them yourself. When evaluating AI tools, try your actual use cases rather than trusting marketing materials. The gap between advertised capability and real capability is often significant.
Pattern 3: Human Oversight Remains Essential
Across every AI experiment I’ve run, human oversight has remained essential. There’s no AI tool I’ve used where I felt comfortable letting it operate fully autonomously.
Where this showed up:
With Moltbot as an SDR, I recommended starting in “supervised mode” where every outbound message requires approval. The AI can draft and suggest, but a human needs to verify before anything reaches a prospect.
For AI-generated code, review is more important than ever—the code looks plausible but can be subtly wrong in ways that are hard to catch.
Even for low-stakes tasks like drafting blog posts or emails, I find myself reviewing and editing AI output rather than using it directly.
Why this happens:
AI tools optimize for plausibility, not correctness. They generate output that looks right, sounds right, and follows patterns from their training data. But “looks right” and “is right” aren’t the same thing, especially for tasks where accuracy matters.
How to use this pattern:
Build oversight into your AI workflows from the start. Don’t treat human review as overhead to be minimized; treat it as an essential quality gate. The value of AI tools is amplifying human capability, not replacing human judgment.
Pattern 4: Context Is Everything
The same AI tool can be incredibly useful or completely useless depending on how much context you provide and how well that context matches what the AI needs.
Where this showed up:
Generic prompts produce generic outputs. When I asked AI to “write a presentation about our product,” I got bland marketing-speak. When I provided detailed context about the audience, the specific points to emphasize, and examples of the tone I wanted, the output was dramatically better.
For coding tasks, AI performs much better when it can see the relevant parts of your codebase, understand your conventions, and access your documentation. Without that context, it generates code that might work in isolation but doesn’t fit your system.
Why this happens:
AI models work by pattern-matching against their training data. The more specific context you provide, the more precisely they can match patterns. Vague inputs trigger vague patterns; specific inputs trigger specific patterns.
How to use this pattern:
Invest time in crafting context. Don’t just describe what you want—describe why you want it, who it’s for, what constraints apply, and what good looks like. The upfront investment in context pays off in output quality.
Pattern 5: The Cost-Benefit Calculation Isn’t Obvious
For some tasks, AI tools provide massive time savings. For others, the time spent prompting, reviewing, and fixing AI output exceeds the time it would have taken to just do the work manually.
Where this showed up:
For boilerplate code generation—CRUD operations, standard API endpoints, test scaffolding—AI tools are genuinely faster. The time saved is real and significant.
For complex, nuanced work—architectural decisions, security-sensitive code, tasks requiring deep domain knowledge—AI tools often slow me down. I spend time explaining context, reviewing questionable output, and fixing subtle issues.
For tasks in the middle—moderately complex, somewhat repetitive—the calculation is closer to break-even. The AI helps with parts of the task but not others.
Why this happens:
AI tools have a fixed overhead: crafting prompts, evaluating outputs, iterating on problems. For simple tasks, this overhead is small relative to the time saved. For complex tasks, the overhead can exceed the value.
How to use this pattern:
Be selective about which tasks you use AI for. Don’t try to use AI for everything just because it’s available. Evaluate whether the tool actually saves time for your specific task, and be willing to work manually when that’s more efficient.
Pattern 6: Failure Modes Are Different
When AI tools fail, they fail differently than humans. Understanding these failure modes is essential for using AI effectively.
Where this showed up:
Human failures tend to be obviously wrong—typos, missing cases, unclear logic. AI failures tend to be plausibly wrong—code that looks correct but isn’t, suggestions that sound reasonable but miss the point, outputs that follow the wrong pattern confidently.
This makes AI failures harder to catch. Human errors often look like errors; AI errors often look like correct work until you dig deeper.
Why this happens:
AI models are trained to produce plausible output, not correct output. Plausibility and correctness often overlap, but not always. When they diverge, you get confident-sounding wrong answers.
How to use this pattern:
Adjust your review approach for AI-generated work. Don’t just skim for obvious errors—verify that the work actually accomplishes what it should. Be especially skeptical of output that looks polished; polish doesn’t guarantee correctness.
A Framework for Evaluating AI Tools
Based on these patterns, here’s how I now evaluate whether an AI tool is worth using for a particular task:
Question 1: Is this a planning/ideation task or an execution task?
If planning/ideation: AI probably helps. If execution: AI might help with parts, but expect to do significant work yourself.
Question 2: How much does the AI’s integration with other systems matter?
If the task is self-contained text generation: integration doesn’t matter. If the task requires interacting with external systems: test the integration carefully before relying on it.
Question 3: What’s the cost of errors?
Low cost (draft blog post, exploratory code): Let AI help freely, fix issues as you find them. Medium cost (customer-facing content, production code): Use AI, but review everything carefully. High cost (security-critical, legally sensitive, high-stakes decisions): Use AI very cautiously or not at all.
Question 4: How much context does the task require?
Minimal context (general knowledge tasks): AI works well out of the box. Moderate context (your specific domain): Invest in providing context. Extensive context (deep institutional knowledge): AI may struggle even with good prompting.
Question 5: Is the time investment justified?
Quick task: Just do it manually if AI overhead exceeds task time. Moderate task: Worth trying AI, but evaluate honestly. Large task: AI acceleration is more valuable, worth investing in setup and context.
What I’m Optimistic About
Despite the limitations, I’m genuinely optimistic about AI tools. They’re improving rapidly, and even current capabilities provide real value when used appropriately.
I’m optimistic about:
- Acceleration of mechanical work: Boilerplate, scaffolding, and repetitive tasks are genuinely faster with AI.
- Learning and exploration: AI is excellent for quickly understanding unfamiliar domains, APIs, and technologies.
- First drafts: Starting from AI output and editing is often faster than starting from scratch.
- Code review assistance: AI can catch certain classes of issues and flag code that deserves closer inspection.
What I’m Skeptical About
I remain skeptical about:
- Fully autonomous AI work: For anything that matters, human oversight remains essential.
- Deep integration promises: Most integrations are shallower than marketed.
- Universal productivity gains: AI helps with some tasks and hurts with others; the net effect is context-dependent.
- Replacing human judgment: For decisions that require nuance, context, and accountability, humans are still in charge.
Conclusion
A year of AI experiments has left me with a nuanced view: these tools are genuinely useful, but not in the ways that hype suggests. They’re thinking partners, not autonomous workers. They’re accelerants for certain tasks, not universal productivity boosters. They’re powerful when used well and frustrating when used poorly.
The key insight is that AI tools change what work looks like without eliminating work. You spend less time typing and more time directing, reviewing, and refining. Whether that’s a good trade depends on the task, your skill level, and how well you’ve learned to work with AI.
For anyone experimenting with AI tools, my advice is: stay curious, stay skeptical, and measure results rather than trusting promises. The tools that actually help your specific work might not be the ones getting the most attention. And the ways they help might be different from what you expected.
The future of AI-assisted work isn’t about tools that do your job for you. It’s about tools that help you do your job better—if you figure out how to use them effectively. That “if” is doing a lot of work, and figuring it out is the real skill to develop.


