moffkalast 6 days ago

That's pretty cool, but I feel like all of the LLM integrations with ROS so far have sort of entirely missed the point in terms of useful applications. Endless examples of models sending bare bone twist commands do a disservice to what LLMs are good at, it's like swatting flies with a bazooka in terms of compute used, too.

Getting the robot to move from point A to point B is largely a solved problem with traditional probabilistic methods, while niches where LLMs are the best fit I think are largely still unaddressed, e.g.:

- a pipeline for natural language commands to high level commands ("fetch me a beer" to [send nav2 goal to kitchen, get fridge detection from yolo, open fridge with moveit, detect beer with yolo, etc.]

- using a VLM to add semantic information to map areas, e.g. have the robot turn around 4 times in a room, and have the model determine what's there so it can reference it by location and even know where that kitchen and fridge is in the above example

- system monitoring, where an LLM looks at ros2 doctor, htop, topic hz, etc. and determines if something's crashed or isn't behaving properly, and returns a debug report or attempts to fix it with terminal commands

- handling recovery behaviours in general, since a lot of times when robots get stuck the resolution is simple, you just need something to take in the current situational information, reason about it, and pick one of the possible ways to resolve it

1
ponta17 5 days ago

Thanks a lot for the thoughtful feedback — I really appreciate it!

I think there might be a small misunderstanding regarding how the LLM is actually being used here (and in many agent-based setups). The LLM itself isn’t directly executing twist commands or handling motion; it’s acting as a decision-maker that chooses from a set of callable tools (Python functions) based on the task description and intermediate results.

In this case, yes — one of the tools happens to publish Twist commands, but that’s just one of many modular tools the LLM can invoke. Whether it’s controlling motion or running object detection, from the LLM’s point of view it’s simply choosing which function to call next. So the computational load really depends on what the tool does internally — not the LLM’s reasoning process itself.

Of course, I agree with your broader point: we should push toward more meaningful high-level tasks where LLMs can orchestrate complex pipelines — and I think your examples (like fetch-a-beer or map annotation via VLMs) are spot-on.

My goal with this project was to explore that decision-making loop in a minimal, creative setting — kind of like a sandbox for LLM-agent behavior.

Actually, I’m currently working on something along those lines using a TurtleBot3. I’m planning to provide the agent with tools that let it scan obstacles via 3D LiDAR and recognize objects through image processing, so that it can make more context-aware decisions.

Really appreciate the push for deeper use cases — that’s definitely where I want to go next!