Conversation
I've been experimenting with having Claude finish an unfinished coding project. You've really got to watch it like a hawk. It does some inexplicable things like removing important comments, fixing non-existent buffer overflows, breaking protocol parsing etc. What am I missing?
3
0
0

@jmorris Have you tried making it write the tests for the parsing first?

1
0
0

bert hubert πŸ‡ΊπŸ‡¦πŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦

@jmorris you are missing that other users do not care as much as you do about quality.

0
0
1

@jmorris It's been trained on a metric boatload of humans "fixing other things while I'm in here" but I've found it generally amenable to strict instructions not to deviate from its assigned task.

Breaking it up into small units of work just like a human should do anyway can help the stochastic parrot avoid scope creep by keeping the goals tight.

I've done things like "assuming the code is correct, fix all the failing tests" and then reviewed the changes to the tests. Yesterday I had a task that was most definitely easier to change in code than to explain confidently to it what to do, but I was facing dozens of tests with failing mocks and the brain-melting process of then fixing the trivially broken tests. So I had it fix the tests, then reviewed all the changes as a form of indirect review of my own code. A test changing in an unexpected way would have sent me back with new ideas about what was wrong with the code. And in fact my first attempt did exactly that; it helped me very quickly realize a missed assumption, fix that, and start over from scratch. Again writing code by hand and letting the robot adapt tests, then reviewing that all the test changes made sense for adapting to my code change.

But also a good CLAUDE.md that instructs it to always color within the lines has been quite helpful in my experience for reducing this unhelpful behavior. I've also reinforced this in prompts, especially where I think it is likely to go astray.

1
0
1
@mcdanlj Thanks for the info! I'm expecting to need to provide a lot of structure & prompting, which is the real aim of the learning exercise for me here, but I was surprised it did some very random things that did not have any internal logical consistency. This is with Haiku, so perhaps it will be better with one of the current models. At some point I gather we will have an agent which also does the prompt development & managing of the coding AI?
1
0
1

@jmorris I currently have Claude set to choose its own model, and it uses multiple models, and sometimes it spins off sub-agents to do work; it is therefore actually acting as an agent which is prompting sub-agents.

For simple, obvious tasks, the cheaper models are sometimes better because they, in anthropomorphizing terms, can be "less imaginative." 🀣 But also sometimes they just aren't as good. Worktrees and branches and trying things a few different ways can apply to LLM models as well as to trying a few different ways to solve a problem "by hand."

I do run it in a container with very limited access. I used https://github.com/gherlein/localdev as a starting point, but then practically re-wrote it. While Greg Herlein uses it to make --dangerously-skip-permissions relatively safe, I use it because running something that is designed to act as if it has agency exceeds my trust for running outside of fairly strict isolation. ☺

Another tip I've learned: If the LLM starts going off the rails, don't re-prompt that session more than twice trying to redirect. Tell it to write a plan as a markdown file, then exit that session. Review and possibly edit the plan, and then start a new session. You can try other models on the same plan, or iterate on changing the plan. Even before the context window gets all the way full, the LLM "gets stupid" β€” I think of it kind of like a filesystem, where allocation tends to get hard long before the filesystem is actually 100% full.

Even without the LLM going off the rails, it's often a good idea to tell the LLM to write a plan before doing anything, and you can review the plan, and change it.

I hope some of that is useful...

1
0
1
@mcdanlj very useful, thanks. I think you may be describing the context window? I've seen it in other non-gen AI areas.
1
0
0

@jmorris Yes, exactly.

Also if the LLM tells you that it is compacting, that's another hint to have it write a plan and exit and start fresh. Again vaguely like a filesystem, where you often get higher performance if you can back up, wipe, and restore than just "defragmenting" β€” I don't know how strongly the analogy works but it's how I remember it... 🀣

And of course all this might be obsolete tomorrow. πŸ™„

1
0
1
@mcdanlj yep, I assume they will abstract all of this away somehow. btw, it is doing some impressive things already. I said "you are a very senior developer and keen to impress with clean, bug-free code" and that seemed to have a big impact on its attitude. This is for controlling a radio via C-IV over serial, a messy undertaking especially for cross-platform UI and hardware. I told it to make a radio simulator which listens on a local UDP port for testing & it saved me probably hours just on that.
1
0
1

@jmorris Heh! I've been thinking of trying to use Claude to write a touch screen control for my QMX and maybe KX3 to run on an ESP32 "cheap yellow display" so I can have exactly the controls I want. I like the idea of asking it to write a radio simulator for testing β€” I'll keep that in mind if I follow through on this project!

Good luck with the project. πŸ™‚

1
0
1
@mcdanlj tried a simpler project from scratch with a newer model. Took about an hour, then each new feature took 5-10 mins each. Saved many hours of work. https://github.com/xjamesmorris/pskreporter-tool
0
0
1