[AI Dev Tools] Goose, Repository Documentation, and Bug Reproduction at Google
![[AI Dev Tools] Goose, Repository Documentation, and Bug Reproduction at Google](/content/images/size/w960/2025/02/8_documents.png)
Codename Goose: Extensible Development Agent for Code Operations
An open-source development agent that extends beyond basic code suggestions to handle installation, execution, editing, and testing of code.
Key Features:- Goes beyond traditional code suggestions to provide comprehensive development assistance
- Supports full code operation lifecycle including installation, execution, editing, and testing
- Offers extensible architecture allowing customization and additional functionality
- Provides comprehensive documentation and installation guides for easy setup
Stars: 6129
LLM API Engine: Custom API Builder for Web Data Extraction
A tool for creating and deploying APIs that extract structured data from websites using natural language descriptions, powered by LLMs and web scraping technology.
Key Features:- Create APIs by describing data needs in plain English, with automatic schema generation based on these descriptions
- Performs intelligent web scraping using Firecrawl technology with real-time data updates through scheduled scraping
- Provides structured data output with JSON Schema validation and Redis-powered caching
- Supports flexible deployment options including Cloudflare Workers, Vercel Edge Functions, and AWS Lambda
- Built with Next.js 14, React 18, and integrates with Upstash Redis for data storage
- Example use: Create an API to extract company information (name, revenue, employee count) by providing target URLs and scheduling automatic updates
Stars: 632
RepoAgent: LLM-Powered Code Documentation Generator
RepoAgent is a framework that automates repository-level code documentation generation by analyzing code structure, tracking changes, and leveraging LLMs to create comprehensive documentation for Python projects.
Key Features:- Analyzes code structure through Abstract Syntax Trees (AST) to generate detailed documentation for individual code objects
- Tracks Git repository changes automatically and updates documentation accordingly through pre-commit hooks
- Maps bidirectional relationships between code objects to provide comprehensive context
- Generates documentation in multiple threads for improved performance
- Creates and maintains documentation in a GitBook format
- Installation options include pip, GitHub Actions, or PDM for development
- Provides a chat interface for Q&A about the repository and code explanations
- Example use: Generated documentation for XAgent project containing 270,000 lines of code using GPT-3.5-turbo
Stars: 502
LLM-Based Generation of Serverless Functions: An Empirical Study
A study investigating LLMs' ability to generate complete serverless functions, evaluating their performance using open-source repositories and various context levels.
- The focus on serverless functions (FaaS) stems from their smaller architectural footprint compared to monoliths and microservices.
- Methodology involved masking existing serverless functions and having LLMs regenerate them with varying levels of system context.
- Evaluation combined existing repository tests for correctness, software engineering metrics for code quality, and NLP metrics to compare similarity with human-written code.
- Research aims to bridge the gap between design decisions and deployment by exploring automated generation of complete architectural components.
Energy and Performance Analysis of LLM-Generated Code
A study analyzing energy efficiency and performance of code generated by LLMs (Github Copilot, GPT-4, and OpenAI o1-mini) across Python, Java, and C++.
- Evaluation covers LeetCode programming problems, tested on both Mac and PC platforms
- Performance metrics focus on energy consumption and execution efficiency rather than just code correctness
- Results indicate higher success rates in generating Python and Java code compared to C++
Source: AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code
Code Maintainability Fixes: LLM Performance Analysis
A study evaluating how effectively LLMs can fix code maintainability issues across 127 cases from GitHub repositories, comparing Copilot Chat and Llama 3.1 performance.
- Few-shot prompting with Llama achieved the highest success rate at 44.9%, while Copilot Chat and Llama with zero-shot prompting reached 32.29% and 30% respectively.
- Most generated solutions introduced new errors or maintainability issues, despite fixing the original problems.
- A human evaluation with 45 participants reviewing 51 LLM-generated solutions found improved code readability in 68.63% of cases.
Source: Evaluating the Effectiveness of LLMs in Fixing Maintainability Issues in Real-World Projects
LLMSecConfig: Automated Container Security Misconfiguration Repair
A framework that combines Static Analysis Tools with LLMs to automatically fix security misconfigurations in container orchestrators while maintaining operational functionality.
- Static Analysis Tools effectively detect security vulnerabilities but lack automated repair capabilities
- The solution uses advanced prompting techniques and Retrieval-Augmented Generation to fix detected issues
- Testing on 1,000 real-world Kubernetes configurations achieved 94% success rate with minimal introduction of new misconfigurations
Source: LLMSecConfig: An LLM-Based Approach for Fixing Software Container Misconfigurations
SE Arena: Software Engineering Chatbot Evaluation Platform
A benchmarking platform that evaluates software engineering chatbots through interactive, multi-round conversations with repository context integration.
- The platform supports iterative conversations and end-to-end model comparisons through an open-source leaderboard
- RepoChat feature enhances evaluation by incorporating repository context like issues, commits, and pull requests into conversations
- Evaluation framework specifically designed to assess model performance in context-rich software engineering tasks like code generation, debugging, and requirement refinement
Source: SE Arena: Benchmarking Software Engineering Chatbots with Iterative Interactions
IoT-Together: LLM-Enhanced Dynamic IoT Systems with Mixed-Initiative Interaction
A framework that integrates LLMs into IoT systems to enable dynamic service generation and intelligent goal interpretation through user-system collaboration.
- Mixed-Initiative Interaction enables users and IoT systems to work together in creating adaptive solutions aligned with user goals.
- The architecture features a multi-pass dialogue framework for interpreting user needs and generating appropriate services at runtime.
- Implementation in a smart city tourism scenario demonstrated efficient service identification and high adaptation quality through agent-based simulation and user studies.
Source: Leveraging LLMs for Dynamic IoT Systems Generation through Mixed-Initiative Interaction
LLM-Assisted Refinement of Multidimensional Data Cube Design
A study evaluating how LLMs can assist end-users in refining conceptual schemata for multidimensional data cubes, focusing on ChatGPT's GPT-4 model and the Dimensional Fact Model formalism.
- The research focused on automating the refinement process, which typically requires manual collaboration between designers and end-users for tasks like attribute labeling and removal of uninteresting attributes.
- Three research questions explored ChatGPT's competencies in multidimensional modeling, refinement capabilities, and potential improvements through prompt engineering.
- Results demonstrated that careful prompt engineering significantly improved refinement accuracy, with remaining errors addressable through additional prompting.
- Despite improvements, designer oversight remains necessary to ensure the validity of refined schemata.
Source: Using ChatGPT to refine draft conceptual schemata in supply-driven design of multidimensional cubes
LLM Code Generation Security Analysis: Multi-Language Study
A comprehensive analysis of security and quality in code generated by LLMs across multiple programming languages, based on 200 diverse coding tasks.
- Security effectiveness varies significantly across different programming languages, with notable weaknesses in implementing modern security features.
- Generated code often lacks integration with recent security updates, such as Java 17 features and modern C++ practices.
- The evaluation framework includes 200 tasks across six categories, measuring both security implementation and code maintainability.
Source: Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis
BRT Agent: Automated Bug Reproduction Test Generation at Google
A system that automatically generates Bug Reproduction Tests (BRTs) from bug reports, designed to work with Google's large-scale, proprietary codebase.
- Bug Reproduction Tests, which fail when a bug is present and pass when fixed, are crucial for debugging but rarely included in bug reports.
- The LLM-based approach achieves a 28% plausible BRT generation rate on 80 human-reported bugs from Google's internal issue tracker, compared to 10% from the previous LIBRO system.
- Integration with Google's Automated Program Repair (APR) system resulted in 30% more bugs receiving plausible fixes when using the generated BRTs.
- The new Ensemble Pass Rate metric helps select promising fixes, achieving 70% accuracy in selecting plausible fixes from pools of 20 candidates.
Source: Agentic Bug Reproduction for Effective Automated Program Repair at Google
LLM-Based Java Verification: A Specification Generation Approach
A research paper exploring how LLMs can generate annotation-based code specifications for Java verification, with built-in verification tools to validate the generated specifications.
- LLMs demonstrate capability in generating code specifications alongside their code generation abilities
- Deductive verification provides tools to validate LLM-generated specifications, ensuring reliability despite the inherent uncertainty of LLM outputs
- The approach shows potential for scaling to verify large software systems with provable correctness guarantees
Source: Next Steps in LLM-Supported Java Verification
Student-LLM Interaction Study: Software Engineering Education Analysis
A study analyzing how 126 undergraduate students interacted with AI assistants during a 13-week software engineering course, examining conversations, code generation, and integration patterns.
- Student interactions showed a preference for ChatGPT over CoPilot, with ChatGPT generating computationally less complex code.
- Conversational interactions with LLMs produced higher quality code compared to auto-generated solutions.
- Analysis covered student conversations, generated code, code utilization rates, and the level of human intervention needed for codebase integration.
Source: Analysis of Student-LLM Interaction in a Software Engineering Project
LLMs vs Human Experts in Requirements Engineering: A Comparative Study
A study comparing LLMs and human experts in requirements elicitation reveals significant advantages in speed, cost, and quality of LLM-generated requirements.
- LLM-generated requirements showed higher alignment (+1.12) and better completeness (+10.2%) compared to human-generated requirements
- Performance metrics demonstrated LLMs working 720 times faster than human experts, while costing only 0.06% of human expert costs
- Users displayed a bias towards human authorship, attributing better-aligned solutions to human experts despite contrary evidence
Source: Analysis of LLMs vs Human Experts in Requirements Engineering
LLM Multi-Agent System: Autonomous Legacy Web Application Upgrades
A system that autonomously upgrades legacy web applications using multiple LLM agents working in coordination, distributing tasks across different upgrade phases.
- Zero-Shot and One-Shot Learning prompts were used to evaluate the system's effectiveness in updating view files and meeting complex requirements.
- Multiple agents maintain context across tasks, improving solution quality compared to standalone LLM execution in certain scenarios.
- Results showed high precision in updating small outdated files, even with basic prompts.
- Source code is available at: https://github.com/alasalm1/Multi-agent-pipeline
Source: Autonomous Legacy Web Application Upgrades Using a Multi-Agent System
CASEY: Security Vulnerability Triage Using LLMs
A framework that automates the identification of Common Weakness Enumerations (CWEs) and severity assessment of security bugs using LLMs.
- The system uses prompt engineering and contextual information at multiple granularity levels to analyze security vulnerabilities.
- Performance testing on the National Vulnerability Database showed 68% accuracy for CWE identification and 73.6% for severity assessment.
- Combined accuracy for identifying both CWE and severity levels reached 51.2%, demonstrating potential for streamlining vulnerability management workflows.
Source: Streamlining Security Vulnerability Triage with Large Language Models
RepoAudit: Autonomous Repository-Level Code Auditing Using LLM Agents
An LLM-powered autonomous agent system that performs efficient and precise code auditing at the repository level, analyzing data-flow and program paths to identify bugs.
- Built-in agent memory allows exploration of code repositories on demand, focusing on data-flow analysis along feasible program paths in functions.
- A validator component mitigates hallucinations by verifying data-flow facts and checking path conditions of potential bugs to reduce false positives.
- Tests with Claude 3.5 Sonnet discovered 38 true bugs across 15 real-world systems, with an average processing time of 0.44 hours and cost of $2.54 per project.
Source: RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing
AugmenTest: Test Oracle Generation Using LLMs and Documentation
A framework that generates test oracles by inferring correct behavior from software documentation using LLMs, addressing a key challenge in automated test generation.
- The system interprets method behavior from documentation and developer comments, rather than analyzing code directly
- Four variants are available: Simple Prompt, Extended Prompt, RAG with generic prompt, and RAG with Simple Prompt, each providing different levels of contextual information
- Evaluation on 142 Java classes showed the Extended Prompt variant achieved a 30% success rate in generating correct assertions, significantly outperforming the current state-of-the-art TOGA's 8.2%
- RAG-based approaches performed below expectations with an 18.2% success rate in the most conservative testing scenario