[AI Dev Tools] LLM-Powered Code Review, Automated README Generation, GitHub Copilot Enhancement ...
![[AI Dev Tools] LLM-Powered Code Review, Automated README Generation, GitHub Copilot Enhancement ...](/content/images/size/w960/2024/11/asset_list.png)
README Generator CLI: Automated README Creation with LLMs
README Generator CLI is a command-line tool that automatically creates comprehensive README.md files for projects using an LLM, specifically Google's Gemini API.
Key Features:- Generates complete README files from project structure and content, leveraging Gemini's language processing capabilities.
- Flexible usage options, including generating READMEs for the current directory or specified project locations.
- Customizable output, allowing users to specify file names and locations for generated content.
- Detailed logging and error handling to assist with troubleshooting and debugging.
- Supports configuration through environment variables, enabling easy customization of API keys and default settings.
Open Notebook: Privacy-Focused Alternative to Google's Notebook LM
Open Notebook is an open-source tool that empowers users to manage research, generate AI-assisted notes, and interact with content while maintaining privacy and control over their data.
Key Features:- Supports multiple notebooks for organized research across various topics.
- Integrates with various content types including links, PDFs, EPUB, Office files, YouTube videos, audio, and video files.
- Offers AI-assisted note generation and insights.
- Includes built-in full-text and vector search for efficient information retrieval.
- Provides fine-grained context management, allowing users to control what information is shared with the LLM.
- Features a podcast generator to convert notes into audio format.
- Implements a transformation system for extracting custom insights from content.
- Supports multiple LLM providers including OpenAI, Anthropic, Gemini, Vertex AI, Open Router, and Ollama.
Gemini AI Code Reviewer: Automated PR Reviews with Google's Gemini AI
A GitHub Action that automatically reviews pull requests using Google's Gemini AI, providing comments and suggestions to improve source code.
Key Features:- Integrates with GitHub Actions to review pull requests using the Gemini API.
- Analyzes code changes in pull requests, excluding specified file types.
- Generates and posts AI-powered review comments directly on GitHub PRs.
- Supports customization of the Gemini model used for reviews.
- Triggered by commenting "/gemini-review" on a pull request.
llama.vim: Local LLM-assisted Text Completion for Vim
llama.vim is a Vim plugin that provides local LLM-assisted text completion, offering auto-suggestions and performance stats directly in the editor.
Key Features:- Auto-suggest on cursor movement in Insert mode, with manual toggle and easy acceptance options.
- Supports large contexts even on low-end hardware through smart context reuse.
- Configurable context scope, including chunks from open files, edited files, and yanked text.
- Displays performance stats for generated suggestions.
- Requires a running llama.cpp server instance, with recommended settings for different VRAM capacities.
- Compatible with FIM-compatible models, with a collection available on Hugging Face.
@llamaindex/chat-ui: React Component Library for LLM Chat Interfaces
@llamaindex/chat-ui is a React component library providing pre-built UI elements for creating chat interfaces in LLM applications, streamlining the development process.
Key Features:- Pre-built chat components including message bubbles and input fields, with minimal styling for easy customization using Tailwind CSS.
- Supports custom widgets for extending components, such as rendering generated or retrieved documents.
- TypeScript support and seamless integration with LLM backends.
- Code and LaTeX styling using highlight.js and KaTeX, with PDF viewing capabilities.
- Composable components allowing for easy customization and extension with user-defined elements.
- Utilizes the useChatUI hook for sending additional data to chat API endpoints, enhancing flexibility.
Jarvis: Command-Line Personal Assistant for Google Services
Jarvis is a command-line personal assistant that integrates with Google Calendar, Gmail, and Tasks to help manage your digital life.
Key Features:- Integrates with Gmail to view unread emails, Google Calendar to check upcoming events, and Google Tasks for task tracking.
- Uses OpenRouter.ai API for natural language processing, with support for local LLM API as an alternative.
- Provides a streamlined setup process with a start script and automatic token file creation for Google APIs.
- Responds to user queries by polling relevant Google APIs based on message content, providing contextual information.
Codai: AI-Powered Code Assistant for Developers
Codai is an AI code assistant that helps developers manage daily tasks through a session-based CLI, offering context-aware suggestions for code improvement and generation.
Key Features:- Uses RAG (Retrieval-Augmented Generation) to improve code suggestions by embedding and retrieving relevant information based on user input.
- Summarizes full project context using Tree-sitter, sending only signature bodies of code without full implementation to the LLM.
- Supports multiple LLM models, including GPT-4o, GPT-4, and Ollama, with compatibility for both cloud-based and local models.
- Provides intelligent code suggestions, assists with adding new features, refactoring, bug resolution, and code review.
- Allows direct acceptance and application of code changes to the codebase.
- Generates documentation automatically and supports multiple programming languages.
- Offers customizable configuration through a config.yml file or environment variables.
- Implements token management to track consumption for each request, aiding in API cost management.
llm-jq: LLM-Assisted jq Program Generation and Execution
llm-jq is a plugin for LLM that allows users to write and execute jq programs with the assistance of large language models.
Key Features:- Generates and executes jq programs based on natural language descriptions of desired JSON transformations.
- Integrates seamlessly with the LLM command-line tool, allowing for easy installation and use.
- Offers options for silent operation, outputting only the jq program, verbose mode, and customizing the input length for the model.
- Supports piping JSON data directly into the command, making it convenient for processing API responses or other JSON sources.
- Example use: curl -s https://api.github.com/repos/simonw/datasette/issues | llm jq 'count by user.login, top 3' generates a jq program to count and rank GitHub issue creators.
ask.py: AI-Powered Web Search and Summarization Tool
ask.py is a Python program that implements a search-extract-summarize workflow, similar to AI search engines like Perplexity, allowing users to perform web searches and generate summarized answers.
Key Features:- Performs Google searches, scrapes web pages, and uses vector search to find relevant content chunks.
- Utilizes LLMs to generate answers based on the retrieved information.
- Offers both command-line and GradIO web UI interfaces for flexible usage.
- Allows customization of search behavior, including date restrictions and site-specific searches.
- Supports various output modes, including answer generation and structured data extraction.
- Can process a predefined list of URLs instead of performing web searches.
- Integrates with DuckDB for vector and full-text search capabilities.
- Easily deployable to HuggingFace Spaces for sharing and collaboration.
askrepo: Source Code Analysis with LLMs
askrepo is a tool that analyzes Git-managed source code using Google's Gemini API, providing answers to user-specified questions about the codebase.
Key Features:- Reads content from Git-tracked text files in a specified directory, filtering out binary files.
- Sends formatted code content to the Google Gemini API along with a user-defined prompt.
- Provides flexible command-line options for customizing the analysis, including choice of Gemini model and custom prompts.
- Integrates with Git to focus on relevant, tracked files within a project.
- Example use: askrepo can be used to explain code functionality, find potential bugs, or answer specific questions
about a codebase. For instance:
askrepo --prompt "What is the purpose of this code?" ../your-repo/src
Flexpilot AI: Open-Source AI Assistant for VS Code
Flexpilot AI is an open-source VS Code extension that provides flexible AI-powered development assistance, allowing users to integrate their preferred AI providers and models directly into their coding environment.
Key Features:- Native VS Code integration for a seamless coding experience without webviews
- Supports multiple AI providers, including OpenAI, Anthropic, Google Gemini, and others, using the user's own API keys
- AI-powered code completions offer context-aware suggestions and natural language guidance
- Panel Chat and Inline Chat features for interactive AI conversations within the workspace
- Quick Chat provides instant answers with a single shortcut, maintaining workflow continuity
- Smart Variables feature references code elements for more tailored assistance
- Voice Chat enables hands-free coding with spoken questions and real-time code suggestions
- Generates AI-powered commit messages and PR descriptions for clearer code contributions
- Token Usage Insights provide transparency on AI interaction consumption
- Open-source under GNU GPLv3 license, encouraging community contributions and customization
Lumen: AI-Powered Git Commit and Diff Summarization Tool
Lumen is a command-line tool that utilizes AI to generate commit messages, summarize git diffs, and provide insights into code changes without requiring an API key.
Key Features:- Generates commit messages for staged changes and summaries for specific commits or diffs.
- Allows users to ask questions about specific changes, enhancing understanding of code modifications.
- Offers fuzzy-search functionality for commit summary generation using fzf.
- Operates without an API key, providing free and unlimited usage out of the box.
- Supports multiple AI providers, including Phind, OpenAI, Groq, Claude, Ollama, and OpenRouter.
- Provides pretty output formatting using Markdown and mdcat.
- Allows customization through a configuration file, including commit types and provider settings.
Promptim: Automated Prompt Optimization for AI Systems
Promptim is an experimental library that automates the process of improving prompts for AI tasks through systematic optimization.
Key Features:- Automates prompt improvement by running optimization loops on specific tasks using provided datasets and custom evaluators.
- Supports human-in-the-loop feedback through an annotation queue interface for guided optimization.
- Utilizes a metaprompt approach to suggest incremental changes to the current prompt based on performance metrics.
- Provides a CLI for easy task creation, configuration, and training initiation.
- Allows customization of the optimization process through configuration files and command-line arguments.
Ditto: Simple Self-Building Coding Agent for Flask Applications
Ditto is a tool that generates multi-file Flask applications from natural language descriptions using a no-code interface and LLM technology.
Key Features:- Generates Flask applications from simple natural language input, automating the coding process for routes, templates, and static files.
- Self-building agent plans and constructs the application without manual coding, organizing code into a clean, modular structure.
- Web interface allows users to describe their desired application and monitor the generation progress in real-time.
- Requires Python 3.7+ and an OpenAI API key for operation. Installation involves cloning the repository and installing dependencies.
- Example use: Users can create a Flask application by describing it in plain English through the web interface at http://localhost:8080.
Integuru: AI-Powered Integration Code Generator
Integuru is an AI agent that automates the creation of integration code by reverse-engineering internal APIs of various platforms.
Key Features:- Generates a dependency graph of API requests to perform desired actions on external platforms.
- Creates runnable Python code that interacts with internal platform endpoints.
- Supports input variables for flexible graph generation (code generation support coming soon).
- Utilizes OpenAI's GPT-4o for graph generation and o1-preview for code generation.
- Handles complex authentication processes, including two-factor authentication (2FA).
CoqPilot: VS Code Extension for Automated Coq Proof Generation
CoqPilot is a VS Code extension that automates the writing of Coq proofs by combining LLMs and non-machine-learning methods to generate and validate proof candidates for incomplete proofs.
- The plugin identifies proof holes marked with the "admit" tactic in Coq files and generates proof candidates to fill these gaps.
- CoqPilot checks the validity of each generated proof candidate and replaces successful ones in the original file.
- A key feature is its zero-setup experience, allowing users to seamlessly combine multiple Coq generation approaches.
- The tool also serves as a platform for LLM-based experiments on Coq proof generation, including a built-in benchmarking system for various generation methods.
Source: CoqPilot, a plugin for LLM-based generation of proofs
SmartGSN: AI-Powered Assurance Case Management Tool
SmartGSN is an online tool that uses LLMs to (semi-)automate the management of assurance cases complying with GSN notation, helping producers of mission-critical systems demonstrate compliance with industry standards.
- Assurance cases are crucial for demonstrating compliance with industry standards to regulatory authorities, helping prevent system failures in mission-critical systems.
- The tool utilizes LLMs to detect assurance case patterns within manually created cases across various application domains.
- Evaluation of SmartGSN shows strong capability in identifying patterns in assurance cases from five different systems.
- SmartGSN is accessible online at https://smartgsn.vercel.app, with a demonstration video available at https://youtu.be/qLrTHf-SZbM.
Source: SmartGSN: a generative AI-powered online tool for the management of assurance cases
CLDK: Program Analysis Framework for Code LLMs
CLDK (codellm-devkit) is an open-source Python library that simplifies program analysis for different programming languages to support code LLM use cases.
- The framework addresses the challenge of providing code-specific contextual information to LLMs, which is typically derived from program analysis tools.
- CLDK offers an intuitive interface for performing program analysis at various levels of granularity across different programming languages.
- It enables developers to easily integrate detailed code insights, enhancing the efficiency and effectiveness of LLMs in coding tasks.
- The library aims to bridge the gap between language-specific static analysis tools and their integration with code LLMs.
- CLDK is available on GitHub, providing an accessible solution for developers looking to leverage program analysis in conjunction with code LLMs.
Source: Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights
GIS Copilot: Integrating LLMs into GIS Platforms for Autonomous Spatial Analysis
GIS Copilot is a framework that integrates LLMs into existing GIS platforms, enabling users to perform spatial analysis using natural language commands.
- The framework leverages LLMs' reasoning and programming capabilities to generate spatial analysis workflows and code autonomously.
- Implementation focused on QGIS, creating a "GIS Copilot" that allows users to interact with the platform using natural language.
- Evaluation involved over 100 spatial analysis tasks of varying complexity, from basic operations to advanced multi-step processes.
- Results showed high success rates in tool selection and code generation for basic and intermediate tasks, with challenges remaining for more complex operations.
- The study contributes to the vision of Autonomous GIS, potentially simplifying workflows and enabling non-experts to engage in geospatial analysis with minimal prior knowledge.
Source: GIS Copilot: Towards an Autonomous GIS Agent for Spatial Analysis
PyGen: Automated Python Package Creation from User Prompts
PyGen is an automation platform that generates Python packages from user-provided abstract ideas, leveraging large language models to streamline the software development process.
- The platform combines autoregressive LLMs with open-source code generation technologies to create complete Python packages, including documentation.
- PyGen employs a prompt enhancement approach, refining user descriptions into specific, actionable instructions for package creation.
- Evaluation of generated packages and documentation utilized Human Evaluation, LLM-based assessment, and CodeBLEU metrics.
- The tool aims to enhance researcher productivity by enabling the creation of resilient, modular, and well-documented packages for specialized purposes.
- PyGen's code and generated examples are available on GitHub, promoting collaborative development and accessibility in scientific and technological advancement.
Source: PyGen: A Collaborative Human-AI Approach to Python Package Creation
CodeScribe: AI-Assisted Code Translation and Development for Scientific Computing
CodeScribe is a tool that combines prompt engineering with user supervision to facilitate efficient code conversion and development in scientific computing workflows.
- The tool assists in converting Fortran code to C++, addressing the need for modernizing legacy codebases in scientific computing.
- CodeScribe generates Fortran-C APIs, enabling integration of legacy systems with modern C++ libraries.
- It provides developer support for code organization and algorithm implementation, enhancing productivity in scientific workflows.
- The tool was developed and tested on a legacy Fortran codebase used for simulating particle interactions at the Large Hadron Collider.
- While leveraging generative AI for code translation and development, CodeScribe acknowledges the need for manual intervention and incorporates user supervision to ensure correctness.
AlphaTrans: Repository-Level Code Translation and Validation
AlphaTrans is a neuro-symbolic approach for automating repository-level code translation, addressing the challenges of translating complex, real-world projects with dependencies and language-specific features.
- The system translates both source and test code, employing multiple levels of validation to ensure functionality preservation.
- Program analysis is used to decompose programs into fragments, which are then translated in reverse call order to manage complexity for LLMs.
- In tests on ten open-source projects, AlphaTrans translated 6899 code fragments with 99.1% syntactic correctness and validated runtime behavior and functional correctness for 25.8%.
- The translation and validation process averaged 36 hours per project, demonstrating practical scalability.
- For incorrect translations, AlphaTrans generates detailed reports, enabling developers to efficiently fix translation bugs and achieve passing tests.
Source: Repository-Level Compositional Code Translation and Validation
Improving ChatGPT-Based Abbreviation Expansion in Source Code
A study on using ChatGPT for expanding abbreviations in source code identifiers, addressing initial performance issues and proposing improvements.
- Initial evaluation showed ChatGPT significantly underperformed compared to state-of-the-art methods, with 28.2% lower precision and 27.8% lower recall on a public benchmark.
- Root causes for poor performance: lack of context and inability to recognize abbreviations.
- Improvements included providing surrounding source code as context and implementing an iterative approach to identify and mark missed abbreviations in prompts.
- A post-condition check was added to exclude expansions violating common sense.
- These enhancements brought ChatGPT-based abbreviation expansion to a level comparable with state-of-the-art approaches, without requiring expensive source code parsing and deep analysis.
Source: Evaluating and Improving ChatGPT-Based Expansion of Abbreviations
Automated Flaky Test Detection for Quantum Software
A framework for automating the detection of flaky tests in quantum software, addressing the challenges posed by their complexity and probabilistic nature.
- The framework expands on prior manual analysis of 14 quantum software repositories, using transformers and cosine similarity for automated detection.
- Experiments with LLMs from OpenAI GPT and Meta LLaMA families assessed their ability to detect and classify flaky tests from code and issue descriptions.
- Results showed promising outcomes: 25 new flaky tests were identified, expanding the dataset by 54%. Top LLMs achieved an F1-score of 0.8871 for flakiness detection.
- Root cause identification remains a challenge, with LLMs achieving only a 0.5839 F1-score for this task.
- Future work will focus on improving detection techniques and developing automatic fixes for flaky tests in large quantum codebases.
Source: Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis
Multi-Agent AI System Improves Software Development Tools
An experiment demonstrates enhanced performance when integrating Crowdbotics PRD AI and GitHub Copilot in a multi-agent configuration for software development.
- The study focuses on context sharing between two commercial AI tools: Crowdbotics PRD AI for generating software requirements and GitHub Copilot for AI pair-programming.
- By sharing business requirements from PRD AI with GitHub Copilot, the code suggestion capabilities improved by 13.8%.
- Developer task success rate increased by 24.5% when using the integrated multi-agent system.
- This real-world application demonstrates the potential of commercially available AI tools working together to enhance the software development lifecycle.
Source: Improving Performance of Commercially Available AI Products in a Multi-Agent Configuration
LLM-Based JSON Parser Fuzzing: Bug Discovery and Behavioral Analysis
A research project utilizing Large Language Models (LLMs) to enhance JSON parser testing, focusing on generating test cases and mutants for bug discovery and behavioral analysis.
- The project aims to uncover potential bugs in open-source JSON parsers through LLM-generated test cases and mutants.
- Identification of behavioral diversities among different JSON parsers is a key objective of the research.
- This approach builds on the success of fuzzing techniques in uncovering bugs and vulnerabilities across various software systems.
- The research emphasizes the importance of ensuring reliability in JSON parsers, which play a crucial role in modern software development.
Source: Large Language Models Based JSON Parser Fuzzing for Bug Discovery and Behavioral Analysis
M2RC-EVAL: Multilingual Repository-level Code Completion Benchmark
A benchmark for evaluating code LLMs' multilingual repository-level code completion abilities, covering 18 programming languages.
- M2RC-EVAL addresses limitations of existing benchmarks, which typically focus on fewer than 5 languages and report only average scores.
- The benchmark includes fine-grained annotations at bucket and semantic levels, based on parsed abstract syntax trees, to assess performance in various completion scenarios.
- Accompanying the benchmark is M2RC-INSTRUCT, a multilingual instruction dataset designed to enhance repository-level code completion capabilities of existing code LLMs.
- Experimental results demonstrate the effectiveness of both M2RC-EVAL and M2RC-INSTRUCT in improving and evaluating multilingual code completion.
Source: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation
LLM-Generated Test Oracles: Actual vs Expected Behavior Study
A study investigating whether test oracles generated by Large Language Models (LLMs) capture actual or expected program behavior in software testing.
- The research compared LLM-generated test oracles with developer-written and automatically generated test cases for 24 open-source Java repositories.
- LLM-based test generation approaches tend to produce oracles that capture actual program behavior rather than expected behavior, similar to traditional test generation techniques like Randoop and Evosuite.
- LLMs perform better at generating test oracles than classifying correct ones. Their performance improves when code contains meaningful test or variable names.
- Despite limitations, LLM-generated test oracles demonstrate higher fault detection potential compared to those generated by Evosuite.
Source: Do LLMs generate test oracles that capture the actual or the expected program behaviour?
Challenges and Roadmap for Production-Ready Foundation Model Software
A comprehensive analysis of the challenges in transitioning foundation model (FM) software, including large language models (LLMs), from demonstrations to production-ready systems.
- FMware, software systems integrating FMs as core components, faces significant obstacles in production environments. These include reliability issues, high implementation costs, scalability challenges, and privacy regulation compliance.
- Key challenges identified in the FMware lifecycle: FM selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment. Cross-cutting concerns include memory management, observability, and feedback integration.
- The analysis draws from industry experience, diverse data sources, and involvement in the Open Platform for Enterprise AI (OPEA) and FMware lifecycle engineering.
- The paper discusses necessary technologies and strategies to address these challenges, offering guidance for developing scalable, production-ready FMware solutions.
- Findings emphasize the importance of continued research and multi-industry collaboration to advance production-ready FMware development.
Source: From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
APIs for Social Work Research: Leveraging LLMs and AI Services
A guide on using Application Programming Interfaces (APIs) to integrate Large Language Models (LLMs) and other AI services into social work research methodologies.
- APIs serve as essential tools for researchers to access advanced technologies like LLMs and AI services, enhancing research capabilities.
- The paper provides an overview of API functionality and integration into research workflows, addressing common barriers for those without programming experience.
- Practical code examples demonstrate how LLMs can generate API code for accessing specialized services, such as extracting data from unstructured text.
- The guide emphasizes data security, privacy considerations, and ethical concerns when using APIs in research contexts.
- By equipping researchers with these tools and knowledge, the paper aims to expand the impact of social work research through effective incorporation of AI technologies.
LLM-Enhanced Agile Model Driven Development for Code Generation
A research approach integrating LLMs like GPT-4 into Agile Model Driven Development (AMDD) for enhanced code auto-generation, addressing challenges in natural language ambiguity and improving flexibility in software development.
- The study proposes AMDD as a solution to overcome challenges in using LLMs for code generation, particularly addressing ambiguities in natural language descriptions of software.
- Researchers modeled a multi-agent Unmanned Vehicle Fleet system using UML, integrating Object Constraint Language (OCL) for code structure meta-modeling and FIPA ontology language for communication semantics meta-modeling.
- GPT-4's auto-generation capabilities were applied to produce Java and Python code compatible with JADE and PADE frameworks, respectively. The generated code aligned with expected behaviors and showed improvements in agent interactions.
- Evaluation of code complexity revealed that ontology-constrained meta-models produced more complex code, but cyclomatic complexity remained within manageable levels. This suggests the potential for incorporating additional meta-model constraints without exceeding high-risk complexity thresholds.
Source: LLM as a code generator in Agile Model Driven Development
MAD: LLM-Powered Smart Contract Decompiler for Sui Blockchain
MAD (Move AI Decompiler) is a web application that decompiles Sui blockchain smart contract bytecodes into human-readable, logically correct, and re-compilable source code using LLMs.
- The tool addresses the challenge of auditing non-open-source smart contracts on platforms like Sui, where source code is often unavailable.
- MAD's output successfully passes original unit tests and achieves a 66.7% recompilation success rate on real-world smart contracts.
- In a user study with 12 developers, MAD significantly reduced auditing workload compared to traditional decompilers, with outputs comparable to original source code.
- Despite occasional hallucinations and compile errors, MAD provides substantial improvements over traditional decompilers in smart contract logic comprehension and auditing.
- The application has potential implications for enhancing blockchain smart contract transparency, auditing, and education, with possible extensions to other smart contract languages like Solidity.
Collaborative AI Framework for Sentiment Analysis
A framework designed to efficiently distribute and resolve sentiment analysis tasks across various AI systems, addressing challenges in processing complex multimodal data and high-cost feature extraction.
- The framework leverages generative AI models like ChatGPT and Google Gemini to simplify complex sentiment analysis tasks into manageable, phased objectives.
- Developed to meet marketing-oriented software needs, the system integrates diverse AI models for processing multimodal data.
- A case study demonstrates the framework's effectiveness in analyzing sentiments across various online media channels using edge and cloud computing.
- This approach enables practical, widespread applications of sentiment analysis in industry settings, moving beyond specialized research environments.
MAGDA: LLM-Assisted Domain Modeling Tool
MAGDA is a user-friendly tool that leverages large language models (LLMs) and few-shot prompt learning to assist in domain modeling for model-driven engineering (MDE).
- The tool aims to overcome challenges in MDE, such as time constraints, incomplete domain understanding, and adherence to syntactic constraints.
- MAGDA's approach eliminates the need for extensive training on scarce domain-specific datasets, offering versatile support for various modeling activities.
- A user study was conducted to assess the real-world applicability of MAGDA in domain modeling, providing insights into its usability and effectiveness.
- The tool offers valuable recommendations to software modelers, potentially simplifying the software development process through abstraction.
Source: On the Utility of Domain Modeling Assistance with Large Language Models
Evaluating Software Development Agents: A Study of AI-Generated Patches on GitHub Issues
A comprehensive study evaluating 4,892 patches from 10 top-ranked software development agents on 500 real-world GitHub issues, focusing on their impact on code quality.
- No single agent dominated the evaluation, with 170 issues remaining unresolved, indicating room for improvement in AI-driven software development.
- Agent-generated patches that passed unit tests and resolved issues often differed from gold patches created by repository developers, revealing limitations in benchmark test case coverage.
- Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities. Some increased code complexity, while many reduced code duplication and minimized code smells.
- Agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve their effectiveness.
- The study provides valuable insights for advancing AI-driven software development, particularly in the context of real-world GitHub scenarios.
LLM Flaw Reporting Best Practices: Lessons from OLMo Bug Bounty
A study of a bug bounty program for the Open Language Model (OLMo) reveals best practices for reporting flaws in LLMs, aimed at improving safety and reducing potential incidents.
- 495 hackers participated in an open-ended bug bounty for OLMo, conducted by The Allen Institute for AI in August 2024.
- A vendor panel from OLMo's safety program evaluated submissions, awarding cash bounties for demonstrations necessitating public disclosure updates.
- The study identifies best practices for safety reporting processes, artifacts, and safety program staffing in LLM development.
- Lessons learned focus on improving documentation clarity regarding model intent, capabilities, and potential hazards.
Source: To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
LLM Integration in Software: Emerging Solutions for Quality Assurance
A study exploring solutions adopted by software developers to address challenges in integrating LLMs into products, based on interviews and surveys at Microsoft.
- The research utilized a mixed-method approach, including 26 interviews and a survey with 332 responses from Microsoft product teams.
- Unique characteristics of LLMs challenge traditional software development and evaluation assumptions, pushing developers out of their comfort zones.
- The study identified 19 emerging solutions focused on quality assurance for LLM-based products.
- Findings offer insights to guide the development and evaluation of software incorporating LLMs across various industries.
JSONTestGen: LLM-Driven Unit Test Generation for JSON Libraries
JSONTestGen is an approach that uses large language models (LLMs) to generate unit tests for JSON libraries, specifically targeting fastjson2, an open-source library from Alibaba.
- The method leverages LLMs' programming capabilities to create diverse test cases based on historical bug-triggering unit tests and JSON-specific mutation rules.
- Differential testing is employed to systematically identify potential bugs in the generated unit tests.
- JSONTestGen outperformed existing test generation tools in detecting unknown defects, uncovering 34 real bugs in fastjson2, with 30 already fixed.
- While LLM-generated tests can contain errors, particularly in assertions, the research suggests LLMs have potential for classifying false-positive test failures.
- The approach demonstrates promise for improving test oracle automation in JSON libraries, addressing critical issues like data inconsistencies and security vulnerabilities.
Source: Advancing Bug Detection in Fastjson2 with Large Language Models Driven Unit Test Generation
Adversarial Perturbations to Prevent LLM-assisted Cheating in Programming Courses
A study exploring methods to impede LLM-assisted cheating in introductory programming assignments through adversarial perturbations.
- The research investigated the performance of five widely used LLMs on introductory programming problems.
- Adversarial perturbations were examined as a means to degrade LLM performance in code generation.
- A user study revealed that combined perturbations reduced the average correctness score of LLM-generated code by 77%.
- The effectiveness of perturbations in reducing code correctness varied based on their detectability.
Source: Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation
AutoRestTest: Multi-Agent Framework for REST API Testing
AutoRestTest is a black-box framework for REST API testing that employs a multi-agent approach, integrating Multi-Agent Reinforcement Learning (MARL), Semantic Property Dependency Graphs (SPDG), and Large Language Models (LLMs).
- The framework addresses limitations of existing black-box REST API testing tools by focusing on dependencies between test elements rather than treating them in isolation.
- Four collaborative agents (API, dependency, parameter, and value) optimize API exploration, with LLMs handling domain-specific value restrictions and SPDG simplifying the search space for dependencies.
- Evaluated on 12 real-world REST services, AutoRestTest outperformed leading black-box REST API testing tools in code coverage, operation coverage, and fault detection.
- The framework was the only tool able to identify an internal server error in Spotify, demonstrating its effectiveness in real-world scenarios.
- An ablation study confirmed the significant contributions of agent learning, SPDG, and LLM components to the framework's performance.
Source: A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs
RevMate: LLM-based Review Comment Generation in Open and Closed-Source Environments
A large-scale empirical user study evaluating the acceptance and impact of LLM-generated comments in code review processes at Mozilla and Ubisoft.
- RevMate, an LLM-based assistive tool, was integrated into the usual review environments of both organizations. It uses Retrieval Augmented Generation and LLM-as-a-Judge to generate and filter review comments.
- Acceptance rates for LLM-generated comments were 8.1% at Mozilla and 7.2% at Ubisoft. An additional 14.6% and 20.5% of comments, respectively, were marked as valuable review or development tips.
- Refactoring-related comments had higher acceptance rates (18.2% and 18.6%) compared to functional comments (4.8% and 5.2%).
- The median time spent by reviewers to inspect and edit generated comments was 43 seconds per patch, considered reasonable.
- Accepted LLM-generated comments were as likely to lead to future patch revisions as human-written comments (74% vs 73% at chunk-level).
Source: Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study
Time Curves: A Visual Approach to Analyzing Large-Scale Software System Logs
A pipeline combining clustering, LLM summarization, and Time Curves visualization to efficiently analyze and summarize vast collections of software system logs.
- The method addresses the challenges of analyzing large-scale software system logs, which are crucial for insights into system health, performance, and security but difficult to process manually due to their volume and variability.
- Clustering algorithms and LLMs, while useful, have limitations in processing large log collections efficiently or maintaining temporal and relational context.
- The proposed approach uses a semimetric distance to measure similarity between events, enabling meaningful representation through Time Curves visualization.
- This technique can explain main events in logs from different applications without prior knowledge, detect general trends, and identify outliers in parallel and distributed systems.
- The pipeline is expected to significantly reduce the time required for analyzing and resolving system-wide issues, identifying performance bottlenecks, and debugging applications.
Source: Analyzing Logs of Large-Scale Software Systems using Time Curves Visualization
LLMs in Automated Software Refactoring: An Empirical Study
A study investigating the potential of LLMs in automated software refactoring, focusing on identifying refactoring opportunities and recommending solutions.
- The research used a dataset of 180 real-world refactorings from 20 projects to evaluate ChatGPT and Gemini's performance.
- Initially, ChatGPT and Gemini identified only 28 and 7 refactoring opportunities, respectively. However, providing specific prompts increased ChatGPT's success rate from 15.6% to 86.7%.
- ChatGPT recommended solutions for 176 refactorings, with 63.6% comparable to or better than human expert solutions. However, some suggestions were unsafe, changing functionality or introducing syntax errors.
- To mitigate risks, the study proposes RefactoringMirror, a detect-and-reapply tactic using tested refactoring engines to validate LLM recommendations.
Source: An Empirical Study on the Potential of LLMs in Automated Software Refactoring
GUPPY: Automated Update of Android Deprecated API Usages with LLMs
GUPPY is an automated approach that leverages LLMs to update deprecated API usages in Android applications, ensuring compatibility across API levels.
- The system employs GPT-4 with carefully crafted prompts to update deprecated API usages, addressing the challenge of lingering outdated APIs in Android apps.
- GUPPY generates tests, identifies incorrect updates, and refines API usage through an iterative process until tests pass or a specified limit is reached.
- Evaluation on 360 benchmark API usages from 20 deprecated APIs and 156 deprecated API usages from API levels 33 and 34 demonstrates GUPPY's effectiveness over existing techniques.
Source: Automated Update of Android Deprecated API Usages with Large Language Models
AI-Generated Code Detection: Challenges and Improvements
A study examining the effectiveness of current AI-generated code detection tools and proposing new approaches to improve detection accuracy.
- Existing AI detection tools perform poorly in identifying AI-generated source code, lacking generalizability for practical deployment.
- The study proposes several methods to enhance detection performance, including fine-tuning LLMs and using machine learning-based classification with static code metrics or AST-generated code embedding.
- The best-performing model developed in this study outperforms the current state-of-the-art detector (GPTSniffer), achieving an F1 score of 82.55.
- An ablation study was conducted to assess the impact of different source code features on the model's performance.
Source: An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?
AI Assistants in Test Development: Impact and Evaluation
A study analyzing the impact of AI coding assistants on software test development, focusing on their ability to generate unit tests for open-source modules.
- Recent advancements in LLMs and AI-assisted coding tools have transformed software development, enabling the generation of complete programs with minimal human intervention.
- The research evaluates three popular AI coding assistants using the Test Pyramid concept, which categorizes tests into unit, integration, and end-to-end tests.
- Findings indicate that AI-generated tests are of comparable quality to original tests, with notable differences in usage and results among the tools.
- While AI assistants show promise in automated testing, thorough review and testing by developers remain crucial before deployment.
Source: Disrupting Test Development with AI Assistants
LLM-Based Test Oracle Generation from Javadocs
A framework for automating test oracle generation for Java library clients using LLMs and Javadocs, addressing the challenge of creating effective software tests.
- Test oracles, critical for validating code outputs, have been less explored in automation compared to test input generation.
- The approach leverages Javadocs of core Java libraries as a rich source of information for generating test oracles.
- LLMs are employed to interpret Javadocs and create executable test oracles for both normal and exceptional behaviors.
- Experimental results show high success rates: 98.8% of generated oracles are compilable, and 96.4% accurately reflect intended properties.
- Minor errors in the few incorrect oracles can be easily corrected using additional LLM-generated comment information.
DesignRepair: AI-Powered UI Design Quality Improvement
DesignRepair is a dual-stream system that examines and repairs UI design quality issues in LLM-generated interfaces, focusing on both code and rendered page aspects.
- The system utilizes Google's Material Design principles as a knowledge base, encoding them into low-level component and high-level system design knowledge bases.
- DesignRepair employs an LLM to extract key components and Playwright for precise page analysis, aligning these with the established knowledge bases.
- Retrieval-Augmented Generation with advanced LLMs like GPT-4 is used to holistically refine and repair frontend code through a divide-and-conquer approach.
- Evaluations demonstrate significant improvements in adherence to design guidelines, accessibility, and user experience metrics.
Source: DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models
LLMs: Reshaping Software Development and Developer Roles
Large Language Models (LLMs) are transforming software engineering with their ability to generate human-like text, respond to complex queries, and write and interpret code.
- LLMs offer unprecedented opportunities for innovation and collaboration in software development, extending beyond traditional AI applications.
- These models are redefining the role of developers, with the potential to streamline workflows and enhance productivity.
- Early adoption of LLMs in software engineering is crucial for staying competitive in the rapidly evolving tech landscape.
- Challenges persist, necessitating a critical analysis of LLMs' technical strengths, limitations, and real-world applications.
- The paper serves as a guide for developers, organizations, and researchers on harnessing LLMs' power and acquiring necessary skills.
Source: LLMs: A Game-Changer for Software Engineers?
UX-LLM: AI-Powered Usability Issue Prediction for iOS Apps
UX-LLM is a tool that uses a Large Vision-Language Model to predict usability issues in iOS apps, aiming to complement traditional usability testing methods.
- The tool's performance was evaluated on two medium-complexity open-source apps, with precision ranging from 0.61 to 0.66 and recall between 0.35 and 0.38.
- Results indicate UX-LLM can identify valid usability issues but misses the majority of problems found through traditional methods.
- A focus group with an app development team found UX-LLM helpful in identifying unknown usability issues, though concerns about workflow integration were raised.
- While not a replacement for traditional usability evaluation, UX-LLM serves as a valuable supplement, particularly for small teams with limited resources.
- The tool's ability to inspect source code makes it useful for identifying issues in less common user paths.
Source: Does GenAI Make Usability Testing Obsolete?
LLM-based Application Development: Practitioner Insights
A study analyzing practitioner discussions on developing and deploying large language model (LLM) applications in production, providing key insights for software developers.
- The research examined 189 videos from 2022 to 2024, focusing on practitioners actively developing LLM-based systems.
- Analysis revealed 20 topics across 8 themes, with Design & Architecture emerging as the most prevalent theme. Retrieval-augmented generation (RAG) systems were a particular focus within this category.
- Other frequently discussed topics included model capabilities, enhancement techniques (e.g., fine-tuning, prompt engineering), infrastructure and tooling, and risks and ethical challenges.
- The study offers a systematic overview of key aspects practitioners should consider when developing LLM-based applications, highlighting current challenges and areas for further academic research.
Source: Practitioners' Discussions on Building LLM-based Applications for Production
Guidelines for Empirical Studies Involving LLMs in Software Engineering Research
A set of guidelines for conducting and assessing empirical studies that use LLMs in software engineering research, either as part of the research process or for evaluating LLM-based tools.
- The rapid adoption of LLMs in software engineering research since ChatGPT's release has created a need for rigorous empirical evaluation standards.
- Currently, no specific guidelines exist for studies involving LLMs in software engineering research, highlighting a gap in the field.
- The paper aims to initiate a discussion within the software engineering research community to establish common standards for high-quality LLM-related studies.
- These guidelines cover studies that use LLMs as part of the research process (e.g., for data annotation) and those evaluating existing or new LLM-based tools.
Source: Towards Evaluation Guidelines for Empirical Studies involving LLMs
Software Performance Engineering for Foundation Model-Powered Software
A paper highlighting the importance of Software Performance Engineering (SPE) in Foundation Model-powered software (FMware), identifying key challenges and discussing current practices.
- FMware, software powered by Foundation Models like LLMs, requires complex engineering across various domains to become production-ready.
- Performance engineering is critical for FMware to meet throughput and latency goals, avoiding user dissatisfaction and financial loss.
- Four key SPE challenges for FMware: cognitive architecture design, communication protocols, tuning and optimization, and deployment.
- Continuous performance engineering is essential to prevent degradation and ensure efficient hardware use, given FMware's high computational demands.
- The paper discusses problems, current practices, and innovative paths for the software engineering community in addressing these challenges.
Source: Software Performance Engineering for Foundation Model-Powered Software (FMware)
Evaluation of AI Programming Assistants: ChatGPT, Gemini, AlphaCode, and GitHub Copilot
A comprehensive study evaluating the performance of leading AI programming assistants in natural language processing and code generation tasks across multiple programming languages.
- The evaluation focuses on ChatGPT, Gemini (Bard AI), AlphaCode, and GitHub Copilot, assessing their accuracy in Java, Python, and C++ code generation.
- Results highlight the strengths and weaknesses of each model, emphasizing the need for further improvements to enhance reliability and accuracy.
- While these AI assistants demonstrate significant progress in language understanding and code generation, the study underscores the importance of ethical considerations and responsible usage.
- The research provides valuable insights into the rapidly evolving field of AI models for programming, offering a comparative analysis of different LLMs.
- Findings emphasize the need for continued refinement of AI technology and ethical development practices to fully realize the potential of these models in software development.
Source: Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers
Systems Engineering for LLM Integration in Socio-technical Systems
A paper exploring the application of systems engineering principles to address challenges in integrating LLMs into complex socio-technical systems for solving critical societal problems.
- The complexity of socio-technical systems and the nature of LLMs present significant challenges for their integration.
- Systems engineering approach is proposed as a more suitable method for LLM adoption, prioritizing problem context over technological aspects.
- The paper surveys existing systems research efforts in engineering AI-based systems and discusses how systems engineering principles have addressed similar issues in the past.
- Future directions for LLM adoption are provided based on the findings, emphasizing the importance of a holistic approach to integration.
Source: The Systems Engineering Approach in Times of Large Language Models
Guidelines for Empirical Studies with LLMs in Software Engineering
A set of guidelines for conducting and assessing empirical studies involving LLMs in software engineering research, addressing the lack of specific standards in this rapidly evolving field.
- The guidelines focus on studies that either use LLMs as part of the research process or evaluate LLM-based tools.
- This initiative aims to establish community standards for high-quality empirical studies involving LLMs in software engineering.
- The paper acknowledges the significant impact of LLMs on the software engineering research landscape since ChatGPT's release in November 2022.
- These guidelines serve as a starting point for discussion within the software engineering research community to develop a common understanding of best practices.
Source: Towards Evaluation Guidelines for Empirical Studies involving LLMs
LLMs for Teaching Energy-Efficient Software Development
A proposal to develop an undergraduate learning module for energy-efficient software engineering, leveraging LLMs to overcome curriculum constraints.
- The increasing energy consumption of computing systems, particularly in data centers, highlights the need for energy-efficient software engineering techniques in undergraduate education.
- LLMs demonstrate potential as domain experts, capable of generating energy-efficient variations of basic linear algebra codes for ARM64 and AMD64 architectures, along with unit tests and energy measurement harnesses.
- Preliminary studies show energy expenditure reductions of 30-90% on toy examples suitable for classroom use.
- The vision includes developing LLM-based meta-compilers as tools for students to transform high-level algorithms into efficient, hardware-specific implementations.
- The learning module will incorporate systems thinking concepts, enabling students to reason about the local and global effects of energy optimizations.