29 min read

[AI Dev Tools] LLM-Powered Code Review, Automated README Generation, GitHub Copilot Enhancement ...

[AI Dev Tools] LLM-Powered Code Review, Automated README Generation, GitHub Copilot Enhancement ...
Source: https://github.com/lfnovo/open-notebook

README Generator CLI: Automated README Creation with LLMs

README Generator CLI is a command-line tool that automatically creates comprehensive README.md files for projects using an LLM, specifically Google's Gemini API.

Key Features:
  • Generates complete README files from project structure and content, leveraging Gemini's language processing capabilities.
  • Flexible usage options, including generating READMEs for the current directory or specified project locations.
  • Customizable output, allowing users to specify file names and locations for generated content.
  • Detailed logging and error handling to assist with troubleshooting and debugging.
  • Supports configuration through environment variables, enabling easy customization of API keys and default settings.
Source: https://github.com/usedispatch/ReadMe-Generator

Open Notebook: Privacy-Focused Alternative to Google's Notebook LM

Open Notebook is an open-source tool that empowers users to manage research, generate AI-assisted notes, and interact with content while maintaining privacy and control over their data.

Key Features:
  • Supports multiple notebooks for organized research across various topics.
  • Integrates with various content types including links, PDFs, EPUB, Office files, YouTube videos, audio, and video files.
  • Offers AI-assisted note generation and insights.
  • Includes built-in full-text and vector search for efficient information retrieval.
  • Provides fine-grained context management, allowing users to control what information is shared with the LLM.
  • Features a podcast generator to convert notes into audio format.
  • Implements a transformation system for extracting custom insights from content.
  • Supports multiple LLM providers including OpenAI, Anthropic, Gemini, Vertex AI, Open Router, and Ollama.
Source: https://github.com/lfnovo/open-notebook

Gemini AI Code Reviewer: Automated PR Reviews with Google's Gemini AI

A GitHub Action that automatically reviews pull requests using Google's Gemini AI, providing comments and suggestions to improve source code.

Key Features:
  • Integrates with GitHub Actions to review pull requests using the Gemini API.
  • Analyzes code changes in pull requests, excluding specified file types.
  • Generates and posts AI-powered review comments directly on GitHub PRs.
  • Supports customization of the Gemini model used for reviews.
  • Triggered by commenting "/gemini-review" on a pull request.
Source: https://github.com/truongnh1992/gemini-ai-code-reviewer

llama.vim: Local LLM-assisted Text Completion for Vim

llama.vim is a Vim plugin that provides local LLM-assisted text completion, offering auto-suggestions and performance stats directly in the editor.

Key Features:
  • Auto-suggest on cursor movement in Insert mode, with manual toggle and easy acceptance options.
  • Supports large contexts even on low-end hardware through smart context reuse.
  • Configurable context scope, including chunks from open files, edited files, and yanked text.
  • Displays performance stats for generated suggestions.
  • Requires a running llama.cpp server instance, with recommended settings for different VRAM capacities.
  • Compatible with FIM-compatible models, with a collection available on Hugging Face.
Source: https://github.com/ggml-org/llama.vim

@llamaindex/chat-ui: React Component Library for LLM Chat Interfaces

@llamaindex/chat-ui is a React component library providing pre-built UI elements for creating chat interfaces in LLM applications, streamlining the development process.

Key Features:
  • Pre-built chat components including message bubbles and input fields, with minimal styling for easy customization using Tailwind CSS.
  • Supports custom widgets for extending components, such as rendering generated or retrieved documents.
  • TypeScript support and seamless integration with LLM backends.
  • Code and LaTeX styling using highlight.js and KaTeX, with PDF viewing capabilities.
  • Composable components allowing for easy customization and extension with user-defined elements.
  • Utilizes the useChatUI hook for sending additional data to chat API endpoints, enhancing flexibility.
Source: https://github.com/run-llama/chat-ui

Jarvis: Command-Line Personal Assistant for Google Services

Jarvis is a command-line personal assistant that integrates with Google Calendar, Gmail, and Tasks to help manage your digital life.

Key Features:
  • Integrates with Gmail to view unread emails, Google Calendar to check upcoming events, and Google Tasks for task tracking.
  • Uses OpenRouter.ai API for natural language processing, with support for local LLM API as an alternative.
  • Provides a streamlined setup process with a start script and automatic token file creation for Google APIs.
  • Responds to user queries by polling relevant Google APIs based on message content, providing contextual information.
Source: https://github.com/synth-mania/jarvis

Codai: AI-Powered Code Assistant for Developers

Codai is an AI code assistant that helps developers manage daily tasks through a session-based CLI, offering context-aware suggestions for code improvement and generation.

Key Features:
  • Uses RAG (Retrieval-Augmented Generation) to improve code suggestions by embedding and retrieving relevant information based on user input.
  • Summarizes full project context using Tree-sitter, sending only signature bodies of code without full implementation to the LLM.
  • Supports multiple LLM models, including GPT-4o, GPT-4, and Ollama, with compatibility for both cloud-based and local models.
  • Provides intelligent code suggestions, assists with adding new features, refactoring, bug resolution, and code review.
  • Allows direct acceptance and application of code changes to the codebase.
  • Generates documentation automatically and supports multiple programming languages.
  • Offers customizable configuration through a config.yml file or environment variables.
  • Implements token management to track consumption for each request, aiding in API cost management.
Source: https://github.com/meysamhadeli/codai

llm-jq: LLM-Assisted jq Program Generation and Execution

llm-jq is a plugin for LLM that allows users to write and execute jq programs with the assistance of large language models.

Key Features:
  • Generates and executes jq programs based on natural language descriptions of desired JSON transformations.
  • Integrates seamlessly with the LLM command-line tool, allowing for easy installation and use.
  • Offers options for silent operation, outputting only the jq program, verbose mode, and customizing the input length for the model.
  • Supports piping JSON data directly into the command, making it convenient for processing API responses or other JSON sources.
  • Example use: curl -s https://api.github.com/repos/simonw/datasette/issues | llm jq 'count by user.login, top 3' generates a jq program to count and rank GitHub issue creators.
Source: https://github.com/simonw/llm-jq

ask.py: AI-Powered Web Search and Summarization Tool

ask.py is a Python program that implements a search-extract-summarize workflow, similar to AI search engines like Perplexity, allowing users to perform web searches and generate summarized answers.

Key Features:
  • Performs Google searches, scrapes web pages, and uses vector search to find relevant content chunks.
  • Utilizes LLMs to generate answers based on the retrieved information.
  • Offers both command-line and GradIO web UI interfaces for flexible usage.
  • Allows customization of search behavior, including date restrictions and site-specific searches.
  • Supports various output modes, including answer generation and structured data extraction.
  • Can process a predefined list of URLs instead of performing web searches.
  • Integrates with DuckDB for vector and full-text search capabilities.
  • Easily deployable to HuggingFace Spaces for sharing and collaboration.
Source: https://github.com/pengfeng/ask.py

askrepo: Source Code Analysis with LLMs

askrepo is a tool that analyzes Git-managed source code using Google's Gemini API, providing answers to user-specified questions about the codebase.

Key Features:
  • Reads content from Git-tracked text files in a specified directory, filtering out binary files.
  • Sends formatted code content to the Google Gemini API along with a user-defined prompt.
  • Provides flexible command-line options for customizing the analysis, including choice of Gemini model and custom prompts.
  • Integrates with Git to focus on relevant, tracked files within a project.
  • Example use: askrepo can be used to explain code functionality, find potential bugs, or answer specific questions about a codebase. For instance: askrepo --prompt "What is the purpose of this code?" ../your-repo/src
Source: https://github.com/laiso/askrepo

Flexpilot AI: Open-Source AI Assistant for VS Code

Flexpilot AI is an open-source VS Code extension that provides flexible AI-powered development assistance, allowing users to integrate their preferred AI providers and models directly into their coding environment.

Key Features:
  • Native VS Code integration for a seamless coding experience without webviews
  • Supports multiple AI providers, including OpenAI, Anthropic, Google Gemini, and others, using the user's own API keys
  • AI-powered code completions offer context-aware suggestions and natural language guidance
  • Panel Chat and Inline Chat features for interactive AI conversations within the workspace
  • Quick Chat provides instant answers with a single shortcut, maintaining workflow continuity
  • Smart Variables feature references code elements for more tailored assistance
  • Voice Chat enables hands-free coding with spoken questions and real-time code suggestions
  • Generates AI-powered commit messages and PR descriptions for clearer code contributions
  • Token Usage Insights provide transparency on AI interaction consumption
  • Open-source under GNU GPLv3 license, encouraging community contributions and customization
Source: https://github.com/flexpilot-ai/vscode-extension

Lumen: AI-Powered Git Commit and Diff Summarization Tool

Lumen is a command-line tool that utilizes AI to generate commit messages, summarize git diffs, and provide insights into code changes without requiring an API key.

Key Features:
  • Generates commit messages for staged changes and summaries for specific commits or diffs.
  • Allows users to ask questions about specific changes, enhancing understanding of code modifications.
  • Offers fuzzy-search functionality for commit summary generation using fzf.
  • Operates without an API key, providing free and unlimited usage out of the box.
  • Supports multiple AI providers, including Phind, OpenAI, Groq, Claude, Ollama, and OpenRouter.
  • Provides pretty output formatting using Markdown and mdcat.
  • Allows customization through a configuration file, including commit types and provider settings.
Source: https://github.com/jnsahaj/lumen

Promptim: Automated Prompt Optimization for AI Systems

Promptim is an experimental library that automates the process of improving prompts for AI tasks through systematic optimization.

Key Features:
  • Automates prompt improvement by running optimization loops on specific tasks using provided datasets and custom evaluators.
  • Supports human-in-the-loop feedback through an annotation queue interface for guided optimization.
  • Utilizes a metaprompt approach to suggest incremental changes to the current prompt based on performance metrics.
  • Provides a CLI for easy task creation, configuration, and training initiation.
  • Allows customization of the optimization process through configuration files and command-line arguments.
Source: https://github.com/hinthornw/promptimizer

Ditto: Simple Self-Building Coding Agent for Flask Applications

Ditto is a tool that generates multi-file Flask applications from natural language descriptions using a no-code interface and LLM technology.

Key Features:
  • Generates Flask applications from simple natural language input, automating the coding process for routes, templates, and static files.
  • Self-building agent plans and constructs the application without manual coding, organizing code into a clean, modular structure.
  • Web interface allows users to describe their desired application and monitor the generation progress in real-time.
  • Requires Python 3.7+ and an OpenAI API key for operation. Installation involves cloning the repository and installing dependencies.
  • Example use: Users can create a Flask application by describing it in plain English through the web interface at http://localhost:8080.
Source: https://github.com/yoheinakajima/ditto

Integuru: AI-Powered Integration Code Generator

Integuru is an AI agent that automates the creation of integration code by reverse-engineering internal APIs of various platforms.

Key Features:
  • Generates a dependency graph of API requests to perform desired actions on external platforms.
  • Creates runnable Python code that interacts with internal platform endpoints.
  • Supports input variables for flexible graph generation (code generation support coming soon).
  • Utilizes OpenAI's GPT-4o for graph generation and o1-preview for code generation.
  • Handles complex authentication processes, including two-factor authentication (2FA).
Source: https://github.com/Integuru-AI/Integuru

CoqPilot: VS Code Extension for Automated Coq Proof Generation

CoqPilot is a VS Code extension that automates the writing of Coq proofs by combining LLMs and non-machine-learning methods to generate and validate proof candidates for incomplete proofs.

  • The plugin identifies proof holes marked with the "admit" tactic in Coq files and generates proof candidates to fill these gaps.
  • CoqPilot checks the validity of each generated proof candidate and replaces successful ones in the original file.
  • A key feature is its zero-setup experience, allowing users to seamlessly combine multiple Coq generation approaches.
  • The tool also serves as a platform for LLM-based experiments on Coq proof generation, including a built-in benchmarking system for various generation methods.
Tools you can use from the paper:

Source: CoqPilot, a plugin for LLM-based generation of proofs

SmartGSN: AI-Powered Assurance Case Management Tool

SmartGSN is an online tool that uses LLMs to (semi-)automate the management of assurance cases complying with GSN notation, helping producers of mission-critical systems demonstrate compliance with industry standards.

  • Assurance cases are crucial for demonstrating compliance with industry standards to regulatory authorities, helping prevent system failures in mission-critical systems.
  • The tool utilizes LLMs to detect assurance case patterns within manually created cases across various application domains.
  • Evaluation of SmartGSN shows strong capability in identifying patterns in assurance cases from five different systems.
  • SmartGSN is accessible online at https://smartgsn.vercel.app, with a demonstration video available at https://youtu.be/qLrTHf-SZbM.
Tools you can use from the paper:

Source: SmartGSN: a generative AI-powered online tool for the management of assurance cases

CLDK: Program Analysis Framework for Code LLMs

CLDK (codellm-devkit) is an open-source Python library that simplifies program analysis for different programming languages to support code LLM use cases.

  • The framework addresses the challenge of providing code-specific contextual information to LLMs, which is typically derived from program analysis tools.
  • CLDK offers an intuitive interface for performing program analysis at various levels of granularity across different programming languages.
  • It enables developers to easily integrate detailed code insights, enhancing the efficiency and effectiveness of LLMs in coding tasks.
  • The library aims to bridge the gap between language-specific static analysis tools and their integration with code LLMs.
  • CLDK is available on GitHub, providing an accessible solution for developers looking to leverage program analysis in conjunction with code LLMs.
Tools you can use from the paper:

Source: Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights

GIS Copilot: Integrating LLMs into GIS Platforms for Autonomous Spatial Analysis

GIS Copilot is a framework that integrates LLMs into existing GIS platforms, enabling users to perform spatial analysis using natural language commands.

  • The framework leverages LLMs' reasoning and programming capabilities to generate spatial analysis workflows and code autonomously.
  • Implementation focused on QGIS, creating a "GIS Copilot" that allows users to interact with the platform using natural language.
  • Evaluation involved over 100 spatial analysis tasks of varying complexity, from basic operations to advanced multi-step processes.
  • Results showed high success rates in tool selection and code generation for basic and intermediate tasks, with challenges remaining for more complex operations.
  • The study contributes to the vision of Autonomous GIS, potentially simplifying workflows and enabling non-experts to engage in geospatial analysis with minimal prior knowledge.
Tools you can use from the paper:

Source: GIS Copilot: Towards an Autonomous GIS Agent for Spatial Analysis

PyGen: Automated Python Package Creation from User Prompts

PyGen is an automation platform that generates Python packages from user-provided abstract ideas, leveraging large language models to streamline the software development process.

  • The platform combines autoregressive LLMs with open-source code generation technologies to create complete Python packages, including documentation.
  • PyGen employs a prompt enhancement approach, refining user descriptions into specific, actionable instructions for package creation.
  • Evaluation of generated packages and documentation utilized Human Evaluation, LLM-based assessment, and CodeBLEU metrics.
  • The tool aims to enhance researcher productivity by enabling the creation of resilient, modular, and well-documented packages for specialized purposes.
  • PyGen's code and generated examples are available on GitHub, promoting collaborative development and accessibility in scientific and technological advancement.
Tools you can use from the paper:

Source: PyGen: A Collaborative Human-AI Approach to Python Package Creation

CodeScribe: AI-Assisted Code Translation and Development for Scientific Computing

CodeScribe is a tool that combines prompt engineering with user supervision to facilitate efficient code conversion and development in scientific computing workflows.

  • The tool assists in converting Fortran code to C++, addressing the need for modernizing legacy codebases in scientific computing.
  • CodeScribe generates Fortran-C APIs, enabling integration of legacy systems with modern C++ libraries.
  • It provides developer support for code organization and algorithm implementation, enhancing productivity in scientific workflows.
  • The tool was developed and tested on a legacy Fortran codebase used for simulating particle interactions at the Large Hadron Collider.
  • While leveraging generative AI for code translation and development, CodeScribe acknowledges the need for manual intervention and incorporates user supervision to ensure correctness.

Source: Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing

AlphaTrans: Repository-Level Code Translation and Validation

AlphaTrans is a neuro-symbolic approach for automating repository-level code translation, addressing the challenges of translating complex, real-world projects with dependencies and language-specific features.

  • The system translates both source and test code, employing multiple levels of validation to ensure functionality preservation.
  • Program analysis is used to decompose programs into fragments, which are then translated in reverse call order to manage complexity for LLMs.
  • In tests on ten open-source projects, AlphaTrans translated 6899 code fragments with 99.1% syntactic correctness and validated runtime behavior and functional correctness for 25.8%.
  • The translation and validation process averaged 36 hours per project, demonstrating practical scalability.
  • For incorrect translations, AlphaTrans generates detailed reports, enabling developers to efficiently fix translation bugs and achieve passing tests.

Source: Repository-Level Compositional Code Translation and Validation

Improving ChatGPT-Based Abbreviation Expansion in Source Code

A study on using ChatGPT for expanding abbreviations in source code identifiers, addressing initial performance issues and proposing improvements.

  • Initial evaluation showed ChatGPT significantly underperformed compared to state-of-the-art methods, with 28.2% lower precision and 27.8% lower recall on a public benchmark.
  • Root causes for poor performance: lack of context and inability to recognize abbreviations.
  • Improvements included providing surrounding source code as context and implementing an iterative approach to identify and mark missed abbreviations in prompts.
  • A post-condition check was added to exclude expansions violating common sense.
  • These enhancements brought ChatGPT-based abbreviation expansion to a level comparable with state-of-the-art approaches, without requiring expensive source code parsing and deep analysis.

Source: Evaluating and Improving ChatGPT-Based Expansion of Abbreviations

Automated Flaky Test Detection for Quantum Software

A framework for automating the detection of flaky tests in quantum software, addressing the challenges posed by their complexity and probabilistic nature.

  • The framework expands on prior manual analysis of 14 quantum software repositories, using transformers and cosine similarity for automated detection.
  • Experiments with LLMs from OpenAI GPT and Meta LLaMA families assessed their ability to detect and classify flaky tests from code and issue descriptions.
  • Results showed promising outcomes: 25 new flaky tests were identified, expanding the dataset by 54%. Top LLMs achieved an F1-score of 0.8871 for flakiness detection.
  • Root cause identification remains a challenge, with LLMs achieving only a 0.5839 F1-score for this task.
  • Future work will focus on improving detection techniques and developing automatic fixes for flaky tests in large quantum codebases.

Source: Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis

Multi-Agent AI System Improves Software Development Tools

An experiment demonstrates enhanced performance when integrating Crowdbotics PRD AI and GitHub Copilot in a multi-agent configuration for software development.

  • The study focuses on context sharing between two commercial AI tools: Crowdbotics PRD AI for generating software requirements and GitHub Copilot for AI pair-programming.
  • By sharing business requirements from PRD AI with GitHub Copilot, the code suggestion capabilities improved by 13.8%.
  • Developer task success rate increased by 24.5% when using the integrated multi-agent system.
  • This real-world application demonstrates the potential of commercially available AI tools working together to enhance the software development lifecycle.

Source: Improving Performance of Commercially Available AI Products in a Multi-Agent Configuration

LLM-Based JSON Parser Fuzzing: Bug Discovery and Behavioral Analysis

A research project utilizing Large Language Models (LLMs) to enhance JSON parser testing, focusing on generating test cases and mutants for bug discovery and behavioral analysis.

  • The project aims to uncover potential bugs in open-source JSON parsers through LLM-generated test cases and mutants.
  • Identification of behavioral diversities among different JSON parsers is a key objective of the research.
  • This approach builds on the success of fuzzing techniques in uncovering bugs and vulnerabilities across various software systems.
  • The research emphasizes the importance of ensuring reliability in JSON parsers, which play a crucial role in modern software development.

Source: Large Language Models Based JSON Parser Fuzzing for Bug Discovery and Behavioral Analysis

M2RC-EVAL: Multilingual Repository-level Code Completion Benchmark

A benchmark for evaluating code LLMs' multilingual repository-level code completion abilities, covering 18 programming languages.

  • M2RC-EVAL addresses limitations of existing benchmarks, which typically focus on fewer than 5 languages and report only average scores.
  • The benchmark includes fine-grained annotations at bucket and semantic levels, based on parsed abstract syntax trees, to assess performance in various completion scenarios.
  • Accompanying the benchmark is M2RC-INSTRUCT, a multilingual instruction dataset designed to enhance repository-level code completion capabilities of existing code LLMs.
  • Experimental results demonstrate the effectiveness of both M2RC-EVAL and M2RC-INSTRUCT in improving and evaluating multilingual code completion.

Source: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

LLM-Generated Test Oracles: Actual vs Expected Behavior Study

A study investigating whether test oracles generated by Large Language Models (LLMs) capture actual or expected program behavior in software testing.

  • The research compared LLM-generated test oracles with developer-written and automatically generated test cases for 24 open-source Java repositories.
  • LLM-based test generation approaches tend to produce oracles that capture actual program behavior rather than expected behavior, similar to traditional test generation techniques like Randoop and Evosuite.
  • LLMs perform better at generating test oracles than classifying correct ones. Their performance improves when code contains meaningful test or variable names.
  • Despite limitations, LLM-generated test oracles demonstrate higher fault detection potential compared to those generated by Evosuite.

Source: Do LLMs generate test oracles that capture the actual or the expected program behaviour?

Challenges and Roadmap for Production-Ready Foundation Model Software

A comprehensive analysis of the challenges in transitioning foundation model (FM) software, including large language models (LLMs), from demonstrations to production-ready systems.

  • FMware, software systems integrating FMs as core components, faces significant obstacles in production environments. These include reliability issues, high implementation costs, scalability challenges, and privacy regulation compliance.
  • Key challenges identified in the FMware lifecycle: FM selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment. Cross-cutting concerns include memory management, observability, and feedback integration.
  • The analysis draws from industry experience, diverse data sources, and involvement in the Open Platform for Enterprise AI (OPEA) and FMware lifecycle engineering.
  • The paper discusses necessary technologies and strategies to address these challenges, offering guidance for developing scalable, production-ready FMware solutions.
  • Findings emphasize the importance of continued research and multi-industry collaboration to advance production-ready FMware development.

Source: From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

APIs for Social Work Research: Leveraging LLMs and AI Services

A guide on using Application Programming Interfaces (APIs) to integrate Large Language Models (LLMs) and other AI services into social work research methodologies.

  • APIs serve as essential tools for researchers to access advanced technologies like LLMs and AI services, enhancing research capabilities.
  • The paper provides an overview of API functionality and integration into research workflows, addressing common barriers for those without programming experience.
  • Practical code examples demonstrate how LLMs can generate API code for accessing specialized services, such as extracting data from unstructured text.
  • The guide emphasizes data security, privacy considerations, and ethical concerns when using APIs in research contexts.
  • By equipping researchers with these tools and knowledge, the paper aims to expand the impact of social work research through effective incorporation of AI technologies.

Source: Demystifying Application Programming Interfaces (APIs): Unlocking the Power of Large Language Models and Other Web-based AI Services in Social Work Research

LLM-Enhanced Agile Model Driven Development for Code Generation

A research approach integrating LLMs like GPT-4 into Agile Model Driven Development (AMDD) for enhanced code auto-generation, addressing challenges in natural language ambiguity and improving flexibility in software development.

  • The study proposes AMDD as a solution to overcome challenges in using LLMs for code generation, particularly addressing ambiguities in natural language descriptions of software.
  • Researchers modeled a multi-agent Unmanned Vehicle Fleet system using UML, integrating Object Constraint Language (OCL) for code structure meta-modeling and FIPA ontology language for communication semantics meta-modeling.
  • GPT-4's auto-generation capabilities were applied to produce Java and Python code compatible with JADE and PADE frameworks, respectively. The generated code aligned with expected behaviors and showed improvements in agent interactions.
  • Evaluation of code complexity revealed that ontology-constrained meta-models produced more complex code, but cyclomatic complexity remained within manageable levels. This suggests the potential for incorporating additional meta-model constraints without exceeding high-risk complexity thresholds.

Source: LLM as a code generator in Agile Model Driven Development

MAD: LLM-Powered Smart Contract Decompiler for Sui Blockchain

MAD (Move AI Decompiler) is a web application that decompiles Sui blockchain smart contract bytecodes into human-readable, logically correct, and re-compilable source code using LLMs.

  • The tool addresses the challenge of auditing non-open-source smart contracts on platforms like Sui, where source code is often unavailable.
  • MAD's output successfully passes original unit tests and achieves a 66.7% recompilation success rate on real-world smart contracts.
  • In a user study with 12 developers, MAD significantly reduced auditing workload compared to traditional decompilers, with outputs comparable to original source code.
  • Despite occasional hallucinations and compile errors, MAD provides substantial improvements over traditional decompilers in smart contract logic comprehension and auditing.
  • The application has potential implications for enhancing blockchain smart contract transparency, auditing, and education, with possible extensions to other smart contract languages like Solidity.

Source: MAD: Move AI Decompiler to Improve Transparency and Auditability on Non-Open-Source Blockchain Smart Contract

Collaborative AI Framework for Sentiment Analysis

A framework designed to efficiently distribute and resolve sentiment analysis tasks across various AI systems, addressing challenges in processing complex multimodal data and high-cost feature extraction.

  • The framework leverages generative AI models like ChatGPT and Google Gemini to simplify complex sentiment analysis tasks into manageable, phased objectives.
  • Developed to meet marketing-oriented software needs, the system integrates diverse AI models for processing multimodal data.
  • A case study demonstrates the framework's effectiveness in analyzing sentiments across various online media channels using edge and cloud computing.
  • This approach enables practical, widespread applications of sentiment analysis in industry settings, moving beyond specialized research environments.

Source: Collaborative AI in Sentiment Analysis: System Architecture, Data Prediction and Deployment Strategies

MAGDA: LLM-Assisted Domain Modeling Tool

MAGDA is a user-friendly tool that leverages large language models (LLMs) and few-shot prompt learning to assist in domain modeling for model-driven engineering (MDE).

  • The tool aims to overcome challenges in MDE, such as time constraints, incomplete domain understanding, and adherence to syntactic constraints.
  • MAGDA's approach eliminates the need for extensive training on scarce domain-specific datasets, offering versatile support for various modeling activities.
  • A user study was conducted to assess the real-world applicability of MAGDA in domain modeling, providing insights into its usability and effectiveness.
  • The tool offers valuable recommendations to software modelers, potentially simplifying the software development process through abstraction.

Source: On the Utility of Domain Modeling Assistance with Large Language Models

Evaluating Software Development Agents: A Study of AI-Generated Patches on GitHub Issues

A comprehensive study evaluating 4,892 patches from 10 top-ranked software development agents on 500 real-world GitHub issues, focusing on their impact on code quality.

  • No single agent dominated the evaluation, with 170 issues remaining unresolved, indicating room for improvement in AI-driven software development.
  • Agent-generated patches that passed unit tests and resolved issues often differed from gold patches created by repository developers, revealing limitations in benchmark test case coverage.
  • Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities. Some increased code complexity, while many reduced code duplication and minimized code smells.
  • Agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve their effectiveness.
  • The study provides valuable insights for advancing AI-driven software development, particularly in the context of real-world GitHub scenarios.

Source: Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

LLM Flaw Reporting Best Practices: Lessons from OLMo Bug Bounty

A study of a bug bounty program for the Open Language Model (OLMo) reveals best practices for reporting flaws in LLMs, aimed at improving safety and reducing potential incidents.

  • 495 hackers participated in an open-ended bug bounty for OLMo, conducted by The Allen Institute for AI in August 2024.
  • A vendor panel from OLMo's safety program evaluated submissions, awarding cash bounties for demonstrations necessitating public disclosure updates.
  • The study identifies best practices for safety reporting processes, artifacts, and safety program staffing in LLM development.
  • Lessons learned focus on improving documentation clarity regarding model intent, capabilities, and potential hazards.

Source: To Err is AI : A Case Study Informing LLM Flaw Reporting Practices

LLM Integration in Software: Emerging Solutions for Quality Assurance

A study exploring solutions adopted by software developers to address challenges in integrating LLMs into products, based on interviews and surveys at Microsoft.

  • The research utilized a mixed-method approach, including 26 interviews and a survey with 332 responses from Microsoft product teams.
  • Unique characteristics of LLMs challenge traditional software development and evaluation assumptions, pushing developers out of their comfort zones.
  • The study identified 19 emerging solutions focused on quality assurance for LLM-based products.
  • Findings offer insights to guide the development and evaluation of software incorporating LLMs across various industries.

Source: Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products

JSONTestGen: LLM-Driven Unit Test Generation for JSON Libraries

JSONTestGen is an approach that uses large language models (LLMs) to generate unit tests for JSON libraries, specifically targeting fastjson2, an open-source library from Alibaba.

  • The method leverages LLMs' programming capabilities to create diverse test cases based on historical bug-triggering unit tests and JSON-specific mutation rules.
  • Differential testing is employed to systematically identify potential bugs in the generated unit tests.
  • JSONTestGen outperformed existing test generation tools in detecting unknown defects, uncovering 34 real bugs in fastjson2, with 30 already fixed.
  • While LLM-generated tests can contain errors, particularly in assertions, the research suggests LLMs have potential for classifying false-positive test failures.
  • The approach demonstrates promise for improving test oracle automation in JSON libraries, addressing critical issues like data inconsistencies and security vulnerabilities.

Source: Advancing Bug Detection in Fastjson2 with Large Language Models Driven Unit Test Generation

Adversarial Perturbations to Prevent LLM-assisted Cheating in Programming Courses

A study exploring methods to impede LLM-assisted cheating in introductory programming assignments through adversarial perturbations.

  • The research investigated the performance of five widely used LLMs on introductory programming problems.
  • Adversarial perturbations were examined as a means to degrade LLM performance in code generation.
  • A user study revealed that combined perturbations reduced the average correctness score of LLM-generated code by 77%.
  • The effectiveness of perturbations in reducing code correctness varied based on their detectability.

Source: Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation

AutoRestTest: Multi-Agent Framework for REST API Testing

AutoRestTest is a black-box framework for REST API testing that employs a multi-agent approach, integrating Multi-Agent Reinforcement Learning (MARL), Semantic Property Dependency Graphs (SPDG), and Large Language Models (LLMs).

  • The framework addresses limitations of existing black-box REST API testing tools by focusing on dependencies between test elements rather than treating them in isolation.
  • Four collaborative agents (API, dependency, parameter, and value) optimize API exploration, with LLMs handling domain-specific value restrictions and SPDG simplifying the search space for dependencies.
  • Evaluated on 12 real-world REST services, AutoRestTest outperformed leading black-box REST API testing tools in code coverage, operation coverage, and fault detection.
  • The framework was the only tool able to identify an internal server error in Spotify, demonstrating its effectiveness in real-world scenarios.
  • An ablation study confirmed the significant contributions of agent learning, SPDG, and LLM components to the framework's performance.

Source: A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

RevMate: LLM-based Review Comment Generation in Open and Closed-Source Environments

A large-scale empirical user study evaluating the acceptance and impact of LLM-generated comments in code review processes at Mozilla and Ubisoft.

  • RevMate, an LLM-based assistive tool, was integrated into the usual review environments of both organizations. It uses Retrieval Augmented Generation and LLM-as-a-Judge to generate and filter review comments.
  • Acceptance rates for LLM-generated comments were 8.1% at Mozilla and 7.2% at Ubisoft. An additional 14.6% and 20.5% of comments, respectively, were marked as valuable review or development tips.
  • Refactoring-related comments had higher acceptance rates (18.2% and 18.6%) compared to functional comments (4.8% and 5.2%).
  • The median time spent by reviewers to inspect and edit generated comments was 43 seconds per patch, considered reasonable.
  • Accepted LLM-generated comments were as likely to lead to future patch revisions as human-written comments (74% vs 73% at chunk-level).

Source: Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Time Curves: A Visual Approach to Analyzing Large-Scale Software System Logs

A pipeline combining clustering, LLM summarization, and Time Curves visualization to efficiently analyze and summarize vast collections of software system logs.

  • The method addresses the challenges of analyzing large-scale software system logs, which are crucial for insights into system health, performance, and security but difficult to process manually due to their volume and variability.
  • Clustering algorithms and LLMs, while useful, have limitations in processing large log collections efficiently or maintaining temporal and relational context.
  • The proposed approach uses a semimetric distance to measure similarity between events, enabling meaningful representation through Time Curves visualization.
  • This technique can explain main events in logs from different applications without prior knowledge, detect general trends, and identify outliers in parallel and distributed systems.
  • The pipeline is expected to significantly reduce the time required for analyzing and resolving system-wide issues, identifying performance bottlenecks, and debugging applications.

Source: Analyzing Logs of Large-Scale Software Systems using Time Curves Visualization

LLMs in Automated Software Refactoring: An Empirical Study

A study investigating the potential of LLMs in automated software refactoring, focusing on identifying refactoring opportunities and recommending solutions.

  • The research used a dataset of 180 real-world refactorings from 20 projects to evaluate ChatGPT and Gemini's performance.
  • Initially, ChatGPT and Gemini identified only 28 and 7 refactoring opportunities, respectively. However, providing specific prompts increased ChatGPT's success rate from 15.6% to 86.7%.
  • ChatGPT recommended solutions for 176 refactorings, with 63.6% comparable to or better than human expert solutions. However, some suggestions were unsafe, changing functionality or introducing syntax errors.
  • To mitigate risks, the study proposes RefactoringMirror, a detect-and-reapply tactic using tested refactoring engines to validate LLM recommendations.

Source: An Empirical Study on the Potential of LLMs in Automated Software Refactoring

GUPPY: Automated Update of Android Deprecated API Usages with LLMs

GUPPY is an automated approach that leverages LLMs to update deprecated API usages in Android applications, ensuring compatibility across API levels.

  • The system employs GPT-4 with carefully crafted prompts to update deprecated API usages, addressing the challenge of lingering outdated APIs in Android apps.
  • GUPPY generates tests, identifies incorrect updates, and refines API usage through an iterative process until tests pass or a specified limit is reached.
  • Evaluation on 360 benchmark API usages from 20 deprecated APIs and 156 deprecated API usages from API levels 33 and 34 demonstrates GUPPY's effectiveness over existing techniques.

Source: Automated Update of Android Deprecated API Usages with Large Language Models

AI-Generated Code Detection: Challenges and Improvements

A study examining the effectiveness of current AI-generated code detection tools and proposing new approaches to improve detection accuracy.

  • Existing AI detection tools perform poorly in identifying AI-generated source code, lacking generalizability for practical deployment.
  • The study proposes several methods to enhance detection performance, including fine-tuning LLMs and using machine learning-based classification with static code metrics or AST-generated code embedding.
  • The best-performing model developed in this study outperforms the current state-of-the-art detector (GPTSniffer), achieving an F1 score of 82.55.
  • An ablation study was conducted to assess the impact of different source code features on the model's performance.

Source: An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?

AI Assistants in Test Development: Impact and Evaluation

A study analyzing the impact of AI coding assistants on software test development, focusing on their ability to generate unit tests for open-source modules.

  • Recent advancements in LLMs and AI-assisted coding tools have transformed software development, enabling the generation of complete programs with minimal human intervention.
  • The research evaluates three popular AI coding assistants using the Test Pyramid concept, which categorizes tests into unit, integration, and end-to-end tests.
  • Findings indicate that AI-generated tests are of comparable quality to original tests, with notable differences in usage and results among the tools.
  • While AI assistants show promise in automated testing, thorough review and testing by developers remain crucial before deployment.

Source: Disrupting Test Development with AI Assistants

LLM-Based Test Oracle Generation from Javadocs

A framework for automating test oracle generation for Java library clients using LLMs and Javadocs, addressing the challenge of creating effective software tests.

  • Test oracles, critical for validating code outputs, have been less explored in automation compared to test input generation.
  • The approach leverages Javadocs of core Java libraries as a rich source of information for generating test oracles.
  • LLMs are employed to interpret Javadocs and create executable test oracles for both normal and exceptional behaviors.
  • Experimental results show high success rates: 98.8% of generated oracles are compilable, and 96.4% accurately reflect intended properties.
  • Minor errors in the few incorrect oracles can be easily corrected using additional LLM-generated comment information.

Source: Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs

DesignRepair: AI-Powered UI Design Quality Improvement

DesignRepair is a dual-stream system that examines and repairs UI design quality issues in LLM-generated interfaces, focusing on both code and rendered page aspects.

  • The system utilizes Google's Material Design principles as a knowledge base, encoding them into low-level component and high-level system design knowledge bases.
  • DesignRepair employs an LLM to extract key components and Playwright for precise page analysis, aligning these with the established knowledge bases.
  • Retrieval-Augmented Generation with advanced LLMs like GPT-4 is used to holistically refine and repair frontend code through a divide-and-conquer approach.
  • Evaluations demonstrate significant improvements in adherence to design guidelines, accessibility, and user experience metrics.

Source: DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models

LLMs: Reshaping Software Development and Developer Roles

Large Language Models (LLMs) are transforming software engineering with their ability to generate human-like text, respond to complex queries, and write and interpret code.

  • LLMs offer unprecedented opportunities for innovation and collaboration in software development, extending beyond traditional AI applications.
  • These models are redefining the role of developers, with the potential to streamline workflows and enhance productivity.
  • Early adoption of LLMs in software engineering is crucial for staying competitive in the rapidly evolving tech landscape.
  • Challenges persist, necessitating a critical analysis of LLMs' technical strengths, limitations, and real-world applications.
  • The paper serves as a guide for developers, organizations, and researchers on harnessing LLMs' power and acquiring necessary skills.

Source: LLMs: A Game-Changer for Software Engineers?

UX-LLM: AI-Powered Usability Issue Prediction for iOS Apps

UX-LLM is a tool that uses a Large Vision-Language Model to predict usability issues in iOS apps, aiming to complement traditional usability testing methods.

  • The tool's performance was evaluated on two medium-complexity open-source apps, with precision ranging from 0.61 to 0.66 and recall between 0.35 and 0.38.
  • Results indicate UX-LLM can identify valid usability issues but misses the majority of problems found through traditional methods.
  • A focus group with an app development team found UX-LLM helpful in identifying unknown usability issues, though concerns about workflow integration were raised.
  • While not a replacement for traditional usability evaluation, UX-LLM serves as a valuable supplement, particularly for small teams with limited resources.
  • The tool's ability to inspect source code makes it useful for identifying issues in less common user paths.

Source: Does GenAI Make Usability Testing Obsolete?

LLM-based Application Development: Practitioner Insights

A study analyzing practitioner discussions on developing and deploying large language model (LLM) applications in production, providing key insights for software developers.

  • The research examined 189 videos from 2022 to 2024, focusing on practitioners actively developing LLM-based systems.
  • Analysis revealed 20 topics across 8 themes, with Design & Architecture emerging as the most prevalent theme. Retrieval-augmented generation (RAG) systems were a particular focus within this category.
  • Other frequently discussed topics included model capabilities, enhancement techniques (e.g., fine-tuning, prompt engineering), infrastructure and tooling, and risks and ethical challenges.
  • The study offers a systematic overview of key aspects practitioners should consider when developing LLM-based applications, highlighting current challenges and areas for further academic research.

Source: Practitioners' Discussions on Building LLM-based Applications for Production

Guidelines for Empirical Studies Involving LLMs in Software Engineering Research

A set of guidelines for conducting and assessing empirical studies that use LLMs in software engineering research, either as part of the research process or for evaluating LLM-based tools.

  • The rapid adoption of LLMs in software engineering research since ChatGPT's release has created a need for rigorous empirical evaluation standards.
  • Currently, no specific guidelines exist for studies involving LLMs in software engineering research, highlighting a gap in the field.
  • The paper aims to initiate a discussion within the software engineering research community to establish common standards for high-quality LLM-related studies.
  • These guidelines cover studies that use LLMs as part of the research process (e.g., for data annotation) and those evaluating existing or new LLM-based tools.

Source: Towards Evaluation Guidelines for Empirical Studies involving LLMs

Software Performance Engineering for Foundation Model-Powered Software

A paper highlighting the importance of Software Performance Engineering (SPE) in Foundation Model-powered software (FMware), identifying key challenges and discussing current practices.

  • FMware, software powered by Foundation Models like LLMs, requires complex engineering across various domains to become production-ready.
  • Performance engineering is critical for FMware to meet throughput and latency goals, avoiding user dissatisfaction and financial loss.
  • Four key SPE challenges for FMware: cognitive architecture design, communication protocols, tuning and optimization, and deployment.
  • Continuous performance engineering is essential to prevent degradation and ensure efficient hardware use, given FMware's high computational demands.
  • The paper discusses problems, current practices, and innovative paths for the software engineering community in addressing these challenges.

Source: Software Performance Engineering for Foundation Model-Powered Software (FMware)

Evaluation of AI Programming Assistants: ChatGPT, Gemini, AlphaCode, and GitHub Copilot

A comprehensive study evaluating the performance of leading AI programming assistants in natural language processing and code generation tasks across multiple programming languages.

  • The evaluation focuses on ChatGPT, Gemini (Bard AI), AlphaCode, and GitHub Copilot, assessing their accuracy in Java, Python, and C++ code generation.
  • Results highlight the strengths and weaknesses of each model, emphasizing the need for further improvements to enhance reliability and accuracy.
  • While these AI assistants demonstrate significant progress in language understanding and code generation, the study underscores the importance of ethical considerations and responsible usage.
  • The research provides valuable insights into the rapidly evolving field of AI models for programming, offering a comparative analysis of different LLMs.
  • Findings emphasize the need for continued refinement of AI technology and ethical development practices to fully realize the potential of these models in software development.

Source: Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers

Systems Engineering for LLM Integration in Socio-technical Systems

A paper exploring the application of systems engineering principles to address challenges in integrating LLMs into complex socio-technical systems for solving critical societal problems.

  • The complexity of socio-technical systems and the nature of LLMs present significant challenges for their integration.
  • Systems engineering approach is proposed as a more suitable method for LLM adoption, prioritizing problem context over technological aspects.
  • The paper surveys existing systems research efforts in engineering AI-based systems and discusses how systems engineering principles have addressed similar issues in the past.
  • Future directions for LLM adoption are provided based on the findings, emphasizing the importance of a holistic approach to integration.

Source: The Systems Engineering Approach in Times of Large Language Models

Guidelines for Empirical Studies with LLMs in Software Engineering

A set of guidelines for conducting and assessing empirical studies involving LLMs in software engineering research, addressing the lack of specific standards in this rapidly evolving field.

  • The guidelines focus on studies that either use LLMs as part of the research process or evaluate LLM-based tools.
  • This initiative aims to establish community standards for high-quality empirical studies involving LLMs in software engineering.
  • The paper acknowledges the significant impact of LLMs on the software engineering research landscape since ChatGPT's release in November 2022.
  • These guidelines serve as a starting point for discussion within the software engineering research community to develop a common understanding of best practices.

Source: Towards Evaluation Guidelines for Empirical Studies involving LLMs

LLMs for Teaching Energy-Efficient Software Development

A proposal to develop an undergraduate learning module for energy-efficient software engineering, leveraging LLMs to overcome curriculum constraints.

  • The increasing energy consumption of computing systems, particularly in data centers, highlights the need for energy-efficient software engineering techniques in undergraduate education.
  • LLMs demonstrate potential as domain experts, capable of generating energy-efficient variations of basic linear algebra codes for ARM64 and AMD64 architectures, along with unit tests and energy measurement harnesses.
  • Preliminary studies show energy expenditure reductions of 30-90% on toy examples suitable for classroom use.
  • The vision includes developing LLM-based meta-compilers as tools for students to transform high-level algorithms into efficient, hardware-specific implementations.
  • The learning module will incorporate systems thinking concepts, enabling students to reason about the local and global effects of energy optimizations.

Source: Can Large-Language Models Help us Better Understand and Teach the Development of Energy-Efficient Software?