05 Feb 2025 8 min read

[AI Dev Tools] Goose, Repository Documentation, and Bug Reproduction at Google

Codename Goose: Extensible Development Agent for Code Operations

An open-source development agent that extends beyond basic code suggestions to handle installation, execution, editing, and testing of code.

Key Features:

Goes beyond traditional code suggestions to provide comprehensive development assistance
Supports full code operation lifecycle including installation, execution, editing, and testing
Offers extensible architecture allowing customization and additional functionality
Provides comprehensive documentation and installation guides for easy setup

Source: https://github.com/block/goose
Stars: 6129

LLM API Engine: Custom API Builder for Web Data Extraction

A tool for creating and deploying APIs that extract structured data from websites using natural language descriptions, powered by LLMs and web scraping technology.

Key Features:

Create APIs by describing data needs in plain English, with automatic schema generation based on these descriptions
Performs intelligent web scraping using Firecrawl technology with real-time data updates through scheduled scraping
Provides structured data output with JSON Schema validation and Redis-powered caching
Supports flexible deployment options including Cloudflare Workers, Vercel Edge Functions, and AWS Lambda
Built with Next.js 14, React 18, and integrates with Upstash Redis for data storage
Example use: Create an API to extract company information (name, revenue, employee count) by providing target URLs and scheduling automatic updates

Source: https://github.com/developersdigest/llm-api-engine
Stars: 632

RepoAgent: LLM-Powered Code Documentation Generator

RepoAgent is a framework that automates repository-level code documentation generation by analyzing code structure, tracking changes, and leveraging LLMs to create comprehensive documentation for Python projects.

Key Features:

Analyzes code structure through Abstract Syntax Trees (AST) to generate detailed documentation for individual code objects
Tracks Git repository changes automatically and updates documentation accordingly through pre-commit hooks
Maps bidirectional relationships between code objects to provide comprehensive context
Generates documentation in multiple threads for improved performance
Creates and maintains documentation in a GitBook format
Installation options include pip, GitHub Actions, or PDM for development
Provides a chat interface for Q&A about the repository and code explanations
Example use: Generated documentation for XAgent project containing 270,000 lines of code using GPT-3.5-turbo

Source: https://github.com/OpenBMB/RepoAgent
Stars: 502

LLM-Based Generation of Serverless Functions: An Empirical Study

A study investigating LLMs' ability to generate complete serverless functions, evaluating their performance using open-source repositories and various context levels.

The focus on serverless functions (FaaS) stems from their smaller architectural footprint compared to monoliths and microservices.
Methodology involved masking existing serverless functions and having LLMs regenerate them with varying levels of system context.
Evaluation combined existing repository tests for correctness, software engineering metrics for code quality, and NLP metrics to compare similarity with human-written code.
Research aims to bridge the gap between design decisions and deployment by exploring automated generation of complete architectural components.

Tools you can use from the paper:

https://doi.org/10.5281/zenodo.1453978

Source: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

Energy and Performance Analysis of LLM-Generated Code

A study analyzing energy efficiency and performance of code generated by LLMs (Github Copilot, GPT-4, and OpenAI o1-mini) across Python, Java, and C++.

Evaluation covers LeetCode programming problems, tested on both Mac and PC platforms
Performance metrics focus on energy consumption and execution efficiency rather than just code correctness
Results indicate higher success rates in generating Python and Java code compared to C++

Source: AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code

Code Maintainability Fixes: LLM Performance Analysis

A study evaluating how effectively LLMs can fix code maintainability issues across 127 cases from GitHub repositories, comparing Copilot Chat and Llama 3.1 performance.

Few-shot prompting with Llama achieved the highest success rate at 44.9%, while Copilot Chat and Llama with zero-shot prompting reached 32.29% and 30% respectively.
Most generated solutions introduced new errors or maintainability issues, despite fixing the original problems.
A human evaluation with 45 participants reviewing 51 LLM-generated solutions found improved code readability in 68.63% of cases.

Tools you can use from the paper:

https://zenodo.org/records/13921292

Source: Evaluating the Effectiveness of LLMs in Fixing Maintainability Issues in Real-World Projects

LLMSecConfig: Automated Container Security Misconfiguration Repair

A framework that combines Static Analysis Tools with LLMs to automatically fix security misconfigurations in container orchestrators while maintaining operational functionality.

Static Analysis Tools effectively detect security vulnerabilities but lack automated repair capabilities
The solution uses advanced prompting techniques and Retrieval-Augmented Generation to fix detected issues
Testing on 1,000 real-world Kubernetes configurations achieved 94% success rate with minimal introduction of new misconfigurations

Tools you can use from the paper:

https://figshare.com/s/2a9be8ccfbec9d8ba199

Source: LLMSecConfig: An LLM-Based Approach for Fixing Software Container Misconfigurations

SE Arena: Software Engineering Chatbot Evaluation Platform

A benchmarking platform that evaluates software engineering chatbots through interactive, multi-round conversations with repository context integration.

The platform supports iterative conversations and end-to-end model comparisons through an open-source leaderboard
RepoChat feature enhances evaluation by incorporating repository context like issues, commits, and pull requests into conversations
Evaluation framework specifically designed to assess model performance in context-rich software engineering tasks like code generation, debugging, and requirement refinement

Tools you can use from the paper:

https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena

Source: SE Arena: Benchmarking Software Engineering Chatbots with Iterative Interactions

IoT-Together: LLM-Enhanced Dynamic IoT Systems with Mixed-Initiative Interaction

A framework that integrates LLMs into IoT systems to enable dynamic service generation and intelligent goal interpretation through user-system collaboration.

Mixed-Initiative Interaction enables users and IoT systems to work together in creating adaptive solutions aligned with user goals.
The architecture features a multi-pass dialogue framework for interpreting user needs and generating appropriate services at runtime.
Implementation in a smart city tourism scenario demonstrated efficient service identification and high adaptation quality through agent-based simulation and user studies.

Tools you can use from the paper:

https://smartcityresearch.iiit.ac.in

Source: Leveraging LLMs for Dynamic IoT Systems Generation through Mixed-Initiative Interaction

LLM-Assisted Refinement of Multidimensional Data Cube Design

A study evaluating how LLMs can assist end-users in refining conceptual schemata for multidimensional data cubes, focusing on ChatGPT's GPT-4 model and the Dimensional Fact Model formalism.

The research focused on automating the refinement process, which typically requires manual collaboration between designers and end-users for tasks like attribute labeling and removal of uninteresting attributes.
Three research questions explored ChatGPT's competencies in multidimensional modeling, refinement capabilities, and potential improvements through prompt engineering.
Results demonstrated that careful prompt engineering significantly improved refinement accuracy, with remaining errors addressable through additional prompting.
Despite improvements, designer oversight remains necessary to ensure the validity of refined schemata.

Source: Using ChatGPT to refine draft conceptual schemata in supply-driven design of multidimensional cubes

LLM Code Generation Security Analysis: Multi-Language Study

A comprehensive analysis of security and quality in code generated by LLMs across multiple programming languages, based on 200 diverse coding tasks.

Security effectiveness varies significantly across different programming languages, with notable weaknesses in implementing modern security features.
Generated code often lacks integration with recent security updates, such as Java 17 features and modern C++ practices.
The evaluation framework includes 200 tasks across six categories, measuring both security implementation and code maintainability.

Source: Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

BRT Agent: Automated Bug Reproduction Test Generation at Google

A system that automatically generates Bug Reproduction Tests (BRTs) from bug reports, designed to work with Google's large-scale, proprietary codebase.

Bug Reproduction Tests, which fail when a bug is present and pass when fixed, are crucial for debugging but rarely included in bug reports.
The LLM-based approach achieves a 28% plausible BRT generation rate on 80 human-reported bugs from Google's internal issue tracker, compared to 10% from the previous LIBRO system.
Integration with Google's Automated Program Repair (APR) system resulted in 30% more bugs receiving plausible fixes when using the generated BRTs.
The new Ensemble Pass Rate metric helps select promising fixes, achieving 70% accuracy in selecting plausible fixes from pools of 20 candidates.

Source: Agentic Bug Reproduction for Effective Automated Program Repair at Google

LLM-Based Java Verification: A Specification Generation Approach

A research paper exploring how LLMs can generate annotation-based code specifications for Java verification, with built-in verification tools to validate the generated specifications.

LLMs demonstrate capability in generating code specifications alongside their code generation abilities
Deductive verification provides tools to validate LLM-generated specifications, ensuring reliability despite the inherent uncertainty of LLM outputs
The approach shows potential for scaling to verify large software systems with provable correctness guarantees

Source: Next Steps in LLM-Supported Java Verification

Student-LLM Interaction Study: Software Engineering Education Analysis

A study analyzing how 126 undergraduate students interacted with AI assistants during a 13-week software engineering course, examining conversations, code generation, and integration patterns.

Student interactions showed a preference for ChatGPT over CoPilot, with ChatGPT generating computationally less complex code.
Conversational interactions with LLMs produced higher quality code compared to auto-generated solutions.
Analysis covered student conversations, generated code, code utilization rates, and the level of human intervention needed for codebase integration.

Source: Analysis of Student-LLM Interaction in a Software Engineering Project

LLMs vs Human Experts in Requirements Engineering: A Comparative Study

A study comparing LLMs and human experts in requirements elicitation reveals significant advantages in speed, cost, and quality of LLM-generated requirements.

LLM-generated requirements showed higher alignment (+1.12) and better completeness (+10.2%) compared to human-generated requirements
Performance metrics demonstrated LLMs working 720 times faster than human experts, while costing only 0.06% of human expert costs
Users displayed a bias towards human authorship, attributing better-aligned solutions to human experts despite contrary evidence

Source: Analysis of LLMs vs Human Experts in Requirements Engineering

LLM Multi-Agent System: Autonomous Legacy Web Application Upgrades

A system that autonomously upgrades legacy web applications using multiple LLM agents working in coordination, distributing tasks across different upgrade phases.

Zero-Shot and One-Shot Learning prompts were used to evaluate the system's effectiveness in updating view files and meeting complex requirements.
Multiple agents maintain context across tasks, improving solution quality compared to standalone LLM execution in certain scenarios.
Results showed high precision in updating small outdated files, even with basic prompts.
Source code is available at: https://github.com/alasalm1/Multi-agent-pipeline

Source: Autonomous Legacy Web Application Upgrades Using a Multi-Agent System

CASEY: Security Vulnerability Triage Using LLMs

A framework that automates the identification of Common Weakness Enumerations (CWEs) and severity assessment of security bugs using LLMs.

The system uses prompt engineering and contextual information at multiple granularity levels to analyze security vulnerabilities.
Performance testing on the National Vulnerability Database showed 68% accuracy for CWE identification and 73.6% for severity assessment.
Combined accuracy for identifying both CWE and severity levels reached 51.2%, demonstrating potential for streamlining vulnerability management workflows.

Source: Streamlining Security Vulnerability Triage with Large Language Models

RepoAudit: Autonomous Repository-Level Code Auditing Using LLM Agents

An LLM-powered autonomous agent system that performs efficient and precise code auditing at the repository level, analyzing data-flow and program paths to identify bugs.

Built-in agent memory allows exploration of code repositories on demand, focusing on data-flow analysis along feasible program paths in functions.
A validator component mitigates hallucinations by verifying data-flow facts and checking path conditions of potential bugs to reduce false positives.
Tests with Claude 3.5 Sonnet discovered 38 true bugs across 15 real-world systems, with an average processing time of 0.44 hours and cost of $2.54 per project.

Source: RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

AugmenTest: Test Oracle Generation Using LLMs and Documentation

A framework that generates test oracles by inferring correct behavior from software documentation using LLMs, addressing a key challenge in automated test generation.

The system interprets method behavior from documentation and developer comments, rather than analyzing code directly
Four variants are available: Simple Prompt, Extended Prompt, RAG with generic prompt, and RAG with Simple Prompt, each providing different levels of contextual information
Evaluation on 142 Java classes showed the Extended Prompt variant achieved a 30% success rate in generating correct assertions, significantly outperforming the current state-of-the-art TOGA's 8.2%
RAG-based approaches performed below expectations with an 18.2% success rate in the most conservative testing scenario

Source: AugmenTest: Enhancing Tests with LLM-Driven Oracles