30 Dec 2024 9 min read

[AI Dev Tools] Turn Cursor to Devin, LLM Generated Tests Readability, Automated Code Review...

Source: https://arxiv.org/pdf/2412.18843v1

Devin.cursorrules: Enhancing Cursor/Windsurf IDE with Advanced AI Capabilities

Devin.cursorrules transforms Cursor or Windsurf IDE into a Devin-like experience, adding advanced agentic AI capabilities in minutes.

Key Features:

Enhances Cursor or Windsurf IDE with process planning, self-evolution, and extended tool usage capabilities.
Includes web scraping with JavaScript support using Playwright, search engine integration via DuckDuckGo, and LLM-powered text analysis.
Provides automated execution for Windsurf users in Docker containers.
Simple setup process using Python virtual environment and pip for dependency management.
Comprehensive unit tests for all included tools, ensuring reliability and functionality.

Source: https://github.com/grapeot/devin.cursorrules

DeepCRCEval: Improved Evaluation for Code Review Comment Generation

A framework for evaluating automated code review comment generation, addressing limitations in traditional text similarity-based methods.

Current benchmarks for code review comments have low quality, with less than 10% suitable for automation.
DeepCRCEval incorporates human evaluators and LLMs to assess comment quality based on criteria from research and developer interviews.
The framework proves more reliable than text similarity metrics in distinguishing high and low-quality comments.
Integration of LLM evaluators reduces evaluation time by 88.78% and cost by 90.32%.
LLM-Reviewer, a new baseline leveraging few-shot learning, shows promise in generating target-oriented comments.

Tools you can use from the paper:

https://zenodo.org/records/10511726

Source: DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

CodeRepoQA: Software Engineering Question-Answering Benchmark

CodeRepoQA is a large-scale benchmark for evaluating repository-level question-answering capabilities in software engineering. It covers five programming languages and various scenarios to comprehensively assess language models.

The dataset contains 585,687 multi-turn question-answering entries, with an average of 6.62 dialogue turns per entry.
Data was crawled from 30 well-known GitHub repositories and carefully filtered.
Evaluation of ten popular LLMs revealed limitations in their software engineering question-answering capabilities.
Medium-length contexts were found to be more conducive to LLM performance.
The benchmark is publicly available on GitHub for further research and development in the field.

Tools you can use from the paper:

https://github.com/kinesiatricssxilm14/CodeRepoQA

Source: CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

A2H Converter: Automated Android to HarmonyOS UI Migration

A2H Converter is an automated tool that migrates Android user interfaces to HarmonyOS, addressing the challenge of developing applications for multiple platforms.

The tool employs an LLM-driven multi-agent framework to convert Android XML layouts into HarmonyOS ArkUI layouts.
It uses RAG (Retrieval-Augmented Generation) combined with decision rules to map Android UI components to ArkUI equivalents.
A reflective mechanism continuously improves conversion accuracy.
A2H Converter handles project-level layouts, ensuring consistency across multiple files and addressing complex UI logic.
Experiments on six Android applications from GitHub demonstrated migration success rates of over 90.1%, 89.3%, and 89.2% at the component, page, and project levels, respectively.

Tools you can use from the paper:

http://124.70.54.129:37860/ (inactive at the time of publishing)

Source: A2H: A UI Converter from Android to HarmonyOS Platform

Improving Readability of Automatically Generated Tests with LLMs

A method to enhance the readability of search-based test generators' output using large language models (LLMs), while maintaining their high code coverage.

Search-based test generators produce effective unit tests with high coverage but lack meaningful names, making them difficult to understand.
The approach focuses on improving test and variable names without altering the tests' semantics or coverage.
Evaluation on nine industrial and open-source LLMs showed that the readability improvements are semantically-preserving and consistent across multiple iterations.
A human study with ten professional developers found that the LLM-improved tests were as readable as developer-written tests, regardless of the LLM used.

Source: Improving the Readability of Automatically Generated Tests using Large Language Models

LLM-based Automated Code Review: Industrial Impact Study

A study examining the effects of LLM-based automated code review tools in an industrial setting, focusing on their impact on software quality, developer experience, and workflow efficiency.

The research analyzed 4,335 pull requests across three projects, with 1,568 undergoing automated reviews using an AI-assisted tool based on open-source Qodo PR Agent.
Data collection involved quantitative analysis of pull request data, developer surveys on individual pull requests, and a broader survey of 22 practitioners' opinions on automated reviews.
Results showed 73.8% of automated comments were resolved, but average pull request closure duration increased from 5 hours 52 minutes to 8 hours 20 minutes.
Most practitioners reported minor improvements in code quality due to automated reviews, with benefits including enhanced bug detection, increased code quality awareness, and promotion of best practices.
Drawbacks of the LLM-based tool included longer pull request closure times, faulty reviews, unnecessary corrections, and irrelevant comments.

Source: Automated Code Review In Practice

Agentable: Detecting Defects in LLM-based AI Agents

A study and tool for identifying and detecting defects in AI agents powered by large language models (LLMs), focusing on discrepancies between developer-implemented logic and LLM-generated content.

The research analyzed 6,854 StackOverflow posts to define 8 types of agent defects, providing detailed descriptions and examples for each.
Agentable, a static analysis tool, was developed to detect these defects using Code Property Graphs and LLMs to analyze agent workflows.
Two datasets were created for evaluation: AgentSet (84 real-world agents) and AgentTest (78 agents with designed defects). Agentable achieved 88.79% accuracy and 91.03% recall in defect detection.
Analysis of AgentSet revealed 889 defects, highlighting the prevalence of these issues in LLM-based AI agents.

Source: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents

SimilarGPT: Smart Contract Vulnerability Detection Using GPT and Code Similarity

SimilarGPT is a tool for identifying vulnerabilities in smart contracts by combining GPT models with code-based similarity checking methods.

The tool measures similarity between the inspected code and secure code from third-party libraries to detect potential vulnerabilities.
Topological ordering optimizes the detection sequence, enhancing logical coherence and reducing false positives.
A comprehensive reference codebase is established through analysis of code reuse patterns in smart contracts.
LLMs conduct in-depth analysis of similar codes to identify and explain potential vulnerabilities.
Experimental results show SimilarGPT excels in detecting vulnerabilities, particularly in reducing missed detections and false positives.

Source: Combining GPT and Code-Based Similarity Checking for Effective Smart Contract Vulnerability Detection

Prompt Evolution in LLM-Integrated Software: An Empirical Study

A study analyzing 1,262 prompt changes across 243 GitHub repositories to understand how developers manage and evolve prompts in LLM-integrated applications.

Developers primarily evolve prompts through additions and modifications, with most changes occurring during feature development.
Only 21.9% of prompt changes are documented in commit messages, highlighting a significant gap in documentation practices.
Key challenges in prompt engineering include the introduction of logical inconsistencies and misalignment between prompt changes and LLM responses.
The findings emphasize the need for specialized testing frameworks, automated validation tools, and improved documentation practices to enhance the reliability of LLM-integrated applications.

Source: Prompting in the Wild: An Empirical Study of Prompt Evolution in Software Repositories

LLM Performance Analysis for Code Summarization

A comparative study of open-source LLMs (LLaMA-3, Phi-3, Mistral, and Gemma) evaluates their performance in generating concise natural language descriptions for source code.

Code summarization aims to create brief, accurate descriptions of source code, an increasingly important task in software engineering.
The analysis focuses on assessing the performance of four open-source LLMs: LLaMA-3, Phi-3, Mistral, and Gemma.
Performance evaluation uses key metrics including BLEU and ROUGE scores to measure the quality and accuracy of generated summaries.
The study's findings are expected to provide insights into each model's strengths and weaknesses, contributing to the ongoing development of LLMs for code summarization tasks.

Source: Analysis on LLMs Performance for Code Summarization

Trust Calibration for AI Refactoring in IDEs

A position paper advocating for the integration of LLM-based refactoring in IDEs with a focus on trust development and safeguards.

LLMs offer a new approach to large-scale code improvement through AI-assisted refactoring, addressing the industry's tendency to prioritize new features over code maintenance.
The paper highlights inherent risks of LLM use, such as breaking changes and potential security vulnerabilities, emphasizing the need for trustworthy safeguards and IDE encapsulation.
Future work will be based on established models from human factors in automation research, focusing on developing novel LLM safeguards and user interactions that foster appropriate trust levels.
The research involves collaboration with CodeScene, enabling large-scale repository analysis and A/B testing to guide the design of research interventions.

Source: Trust Calibration in IDEs: Paving the Way for Widespread Adoption of AI Refactoring

ARCAS: Automated Root Cause Analysis for Complex Data Products

ARCAS is a diagnostic platform using a Domain Specific Language (DSL) for rapid implementation of automated troubleshooting guides in complex data products.

The system comprises a network of automated troubleshooting guides (Auto-TSGs) that run concurrently, detecting issues and applying near-real-time mitigation using product telemetry.
ARCAS's DSL allows subject matter experts to create relevant Auto-TSGs quickly, without needing to understand the entire diagnostic platform, reducing time-to-mitigate and conserving engineering resources.
Unlike monitoring-focused platforms such as Datadog and New Relic, ARCAS employs an LLM to prioritize Auto-TSG outputs and initiate appropriate actions, eliminating the need for manual intervention or comprehensive system knowledge.
The platform has been successfully implemented across multiple products in Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.

Source: Automated Root Cause Analysis System for Complex Data Products

Challenges of Software Engineering with Large Language Models

A comprehensive analysis of the challenges facing software engineering as it integrates large language models (LLMs), based on discussions with over 20 experts from academia and industry.

The integration of LLMs in software engineering (LLM4SE) is transforming the software development life cycle, offering potential for reduced human effort across various development activities.
The study identifies 26 key challenges across seven aspects of software development, including requirements and design, coding assistance, testing, code review, maintenance, vulnerability management, and data handling.
These challenges stem from the paradigm shift brought about by LLMs' unprecedented capacity to understand, generate, and operate programming languages.
The research aims to benefit future LLM4SE studies by highlighting areas that require attention and innovation in this rapidly evolving field.

Source: The Current Challenges of Software Engineering in the Era of Large Language Models

LLM-based Test Generators: Challenges in Bug Detection

A critical examination of LLM-based test generation tools reveals potential limitations in their ability to detect bugs effectively.

Recent tools like Codium CoverAgent and CoverUp, designed for automated test case generation, may unintentionally validate faulty code.
The study focuses on a crucial aspect: these tools' test oracles are designed to pass, potentially conflicting with the primary goal of exposing bugs through failing test cases.
Evaluation using real human-written buggy code demonstrates that LLM-generated tests can miss bugs and, more concerningly, validate them within the generated test suite.
The research raises questions about the design principles behind LLM-based test generation tools and their impact on software quality assurance.

Source: Design choices made by LLM-based test generators prevent them from finding bugs

Generative AI Toolkit: Automating LLM-Based Application Workflows

A framework for improving the quality and efficiency of LLM-based applications throughout their lifecycle, addressing the challenges of manual, slow, and trial-and-error-based development processes.

The toolkit automates essential workflows for configuring, testing, monitoring, and optimizing Generative AI applications, including agents.
Key benefits include significant quality improvements and shorter release cycles for LLM-based applications.
Effectiveness of the toolkit is demonstrated through representative use cases, with best practices shared for implementation.
The Generative AI Toolkit is open-sourced, encouraging adoption, adaptation, and improvement by other development teams.

Source: Generative AI Toolkit -- a framework for increasing the quality of LLM-based applications over their whole life cycle

Visual Code Assistants: Integrating Sketches into IDEs for ML Workflows

A study exploring the integration of Visual Code Assistants in IDEs, focusing on converting sketches into code for machine learning workflows.

The research involved 19 data scientists, examining their sketching patterns when developing ML workflows. Diagrams were the preferred organizational component (52.6%), followed by lists (42.1%) and numbered points (36.8%).
A prototype Visual Code Assistant was developed to convert sketches into Python notebooks using an LLM. The quality of generated code was evaluated using an LLM-as-judge setup.
Results showed that even brief sketching could effectively generate useful code outlines, with a positive correlation between sketch time and code quality.
Interviews with participants revealed promising applications for Visual Code Assistants in education, prototyping, and collaborative settings.
The study suggests potential for the next generation of Code Assistants to integrate visual information, improving code generation and leveraging developers' existing sketching practices.

Source: An Exploratory Study of ML Sketches and Visual Code Assistants

LLMs in Mission-Critical IT Governance: A Survey of Practitioner Perspectives

A survey exploring the potential use of Large Language Models (LLMs) in governing mission-critical IT systems, focusing on security practitioners' views and concerns.

The study aims to provide insights for researchers, practitioners, and policymakers on using generative AI in mission-critical systems (MCSs) governance.
Survey data collected from developers and security personnel will help identify trends, challenges, and opportunities for introducing LLMs in MCS governance.
Findings emphasize the need for interdisciplinary collaboration to ensure safe use of LLMs in MCS governance.
Researchers should focus on developing regulation-oriented models and addressing accountability issues.
Practitioners prioritize data protection and transparency, while policymakers are urged to establish a unified AI framework with global benchmarks for ethical and secure LLM-based MCS governance.

Source: On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?