9 min read

[AI Dev Tools] Turn Cursor to Devin, LLM Generated Tests Readability, Automated Code Review...

[AI Dev Tools] Turn Cursor to Devin, LLM Generated Tests Readability, Automated Code Review...
Source: https://arxiv.org/pdf/2412.18843v1

Devin.cursorrules: Enhancing Cursor/Windsurf IDE with Advanced AI Capabilities

Devin.cursorrules transforms Cursor or Windsurf IDE into a Devin-like experience, adding advanced agentic AI capabilities in minutes.

Key Features:
  • Enhances Cursor or Windsurf IDE with process planning, self-evolution, and extended tool usage capabilities.
  • Includes web scraping with JavaScript support using Playwright, search engine integration via DuckDuckGo, and LLM-powered text analysis.
  • Provides automated execution for Windsurf users in Docker containers.
  • Simple setup process using Python virtual environment and pip for dependency management.
  • Comprehensive unit tests for all included tools, ensuring reliability and functionality.
Source: https://github.com/grapeot/devin.cursorrules

DeepCRCEval: Improved Evaluation for Code Review Comment Generation

A framework for evaluating automated code review comment generation, addressing limitations in traditional text similarity-based methods.

  • Current benchmarks for code review comments have low quality, with less than 10% suitable for automation.
  • DeepCRCEval incorporates human evaluators and LLMs to assess comment quality based on criteria from research and developer interviews.
  • The framework proves more reliable than text similarity metrics in distinguishing high and low-quality comments.
  • Integration of LLM evaluators reduces evaluation time by 88.78% and cost by 90.32%.
  • LLM-Reviewer, a new baseline leveraging few-shot learning, shows promise in generating target-oriented comments.
Tools you can use from the paper:

Source: DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

CodeRepoQA: Software Engineering Question-Answering Benchmark

CodeRepoQA is a large-scale benchmark for evaluating repository-level question-answering capabilities in software engineering. It covers five programming languages and various scenarios to comprehensively assess language models.

  • The dataset contains 585,687 multi-turn question-answering entries, with an average of 6.62 dialogue turns per entry.
  • Data was crawled from 30 well-known GitHub repositories and carefully filtered.
  • Evaluation of ten popular LLMs revealed limitations in their software engineering question-answering capabilities.
  • Medium-length contexts were found to be more conducive to LLM performance.
  • The benchmark is publicly available on GitHub for further research and development in the field.
Tools you can use from the paper:

Source: CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

A2H Converter: Automated Android to HarmonyOS UI Migration

A2H Converter is an automated tool that migrates Android user interfaces to HarmonyOS, addressing the challenge of developing applications for multiple platforms.

  • The tool employs an LLM-driven multi-agent framework to convert Android XML layouts into HarmonyOS ArkUI layouts.
  • It uses RAG (Retrieval-Augmented Generation) combined with decision rules to map Android UI components to ArkUI equivalents.
  • A reflective mechanism continuously improves conversion accuracy.
  • A2H Converter handles project-level layouts, ensuring consistency across multiple files and addressing complex UI logic.
  • Experiments on six Android applications from GitHub demonstrated migration success rates of over 90.1%, 89.3%, and 89.2% at the component, page, and project levels, respectively.
Tools you can use from the paper:

Source: A2H: A UI Converter from Android to HarmonyOS Platform

Improving Readability of Automatically Generated Tests with LLMs

A method to enhance the readability of search-based test generators' output using large language models (LLMs), while maintaining their high code coverage.

  • Search-based test generators produce effective unit tests with high coverage but lack meaningful names, making them difficult to understand.
  • The approach focuses on improving test and variable names without altering the tests' semantics or coverage.
  • Evaluation on nine industrial and open-source LLMs showed that the readability improvements are semantically-preserving and consistent across multiple iterations.
  • A human study with ten professional developers found that the LLM-improved tests were as readable as developer-written tests, regardless of the LLM used.

Source: Improving the Readability of Automatically Generated Tests using Large Language Models

LLM-based Automated Code Review: Industrial Impact Study

A study examining the effects of LLM-based automated code review tools in an industrial setting, focusing on their impact on software quality, developer experience, and workflow efficiency.

  • The research analyzed 4,335 pull requests across three projects, with 1,568 undergoing automated reviews using an AI-assisted tool based on open-source Qodo PR Agent.
  • Data collection involved quantitative analysis of pull request data, developer surveys on individual pull requests, and a broader survey of 22 practitioners' opinions on automated reviews.
  • Results showed 73.8% of automated comments were resolved, but average pull request closure duration increased from 5 hours 52 minutes to 8 hours 20 minutes.
  • Most practitioners reported minor improvements in code quality due to automated reviews, with benefits including enhanced bug detection, increased code quality awareness, and promotion of best practices.
  • Drawbacks of the LLM-based tool included longer pull request closure times, faulty reviews, unnecessary corrections, and irrelevant comments.

Source: Automated Code Review In Practice

Agentable: Detecting Defects in LLM-based AI Agents

A study and tool for identifying and detecting defects in AI agents powered by large language models (LLMs), focusing on discrepancies between developer-implemented logic and LLM-generated content.

  • The research analyzed 6,854 StackOverflow posts to define 8 types of agent defects, providing detailed descriptions and examples for each.
  • Agentable, a static analysis tool, was developed to detect these defects using Code Property Graphs and LLMs to analyze agent workflows.
  • Two datasets were created for evaluation: AgentSet (84 real-world agents) and AgentTest (78 agents with designed defects). Agentable achieved 88.79% accuracy and 91.03% recall in defect detection.
  • Analysis of AgentSet revealed 889 defects, highlighting the prevalence of these issues in LLM-based AI agents.

Source: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents

SimilarGPT: Smart Contract Vulnerability Detection Using GPT and Code Similarity

SimilarGPT is a tool for identifying vulnerabilities in smart contracts by combining GPT models with code-based similarity checking methods.

  • The tool measures similarity between the inspected code and secure code from third-party libraries to detect potential vulnerabilities.
  • Topological ordering optimizes the detection sequence, enhancing logical coherence and reducing false positives.
  • A comprehensive reference codebase is established through analysis of code reuse patterns in smart contracts.
  • LLMs conduct in-depth analysis of similar codes to identify and explain potential vulnerabilities.
  • Experimental results show SimilarGPT excels in detecting vulnerabilities, particularly in reducing missed detections and false positives.

Source: Combining GPT and Code-Based Similarity Checking for Effective Smart Contract Vulnerability Detection

Prompt Evolution in LLM-Integrated Software: An Empirical Study

A study analyzing 1,262 prompt changes across 243 GitHub repositories to understand how developers manage and evolve prompts in LLM-integrated applications.

  • Developers primarily evolve prompts through additions and modifications, with most changes occurring during feature development.
  • Only 21.9% of prompt changes are documented in commit messages, highlighting a significant gap in documentation practices.
  • Key challenges in prompt engineering include the introduction of logical inconsistencies and misalignment between prompt changes and LLM responses.
  • The findings emphasize the need for specialized testing frameworks, automated validation tools, and improved documentation practices to enhance the reliability of LLM-integrated applications.

Source: Prompting in the Wild: An Empirical Study of Prompt Evolution in Software Repositories

LLM Performance Analysis for Code Summarization

A comparative study of open-source LLMs (LLaMA-3, Phi-3, Mistral, and Gemma) evaluates their performance in generating concise natural language descriptions for source code.

  • Code summarization aims to create brief, accurate descriptions of source code, an increasingly important task in software engineering.
  • The analysis focuses on assessing the performance of four open-source LLMs: LLaMA-3, Phi-3, Mistral, and Gemma.
  • Performance evaluation uses key metrics including BLEU and ROUGE scores to measure the quality and accuracy of generated summaries.
  • The study's findings are expected to provide insights into each model's strengths and weaknesses, contributing to the ongoing development of LLMs for code summarization tasks.

Source: Analysis on LLMs Performance for Code Summarization

Trust Calibration for AI Refactoring in IDEs

A position paper advocating for the integration of LLM-based refactoring in IDEs with a focus on trust development and safeguards.

  • LLMs offer a new approach to large-scale code improvement through AI-assisted refactoring, addressing the industry's tendency to prioritize new features over code maintenance.
  • The paper highlights inherent risks of LLM use, such as breaking changes and potential security vulnerabilities, emphasizing the need for trustworthy safeguards and IDE encapsulation.
  • Future work will be based on established models from human factors in automation research, focusing on developing novel LLM safeguards and user interactions that foster appropriate trust levels.
  • The research involves collaboration with CodeScene, enabling large-scale repository analysis and A/B testing to guide the design of research interventions.

Source: Trust Calibration in IDEs: Paving the Way for Widespread Adoption of AI Refactoring

ARCAS: Automated Root Cause Analysis for Complex Data Products

ARCAS is a diagnostic platform using a Domain Specific Language (DSL) for rapid implementation of automated troubleshooting guides in complex data products.

  • The system comprises a network of automated troubleshooting guides (Auto-TSGs) that run concurrently, detecting issues and applying near-real-time mitigation using product telemetry.
  • ARCAS's DSL allows subject matter experts to create relevant Auto-TSGs quickly, without needing to understand the entire diagnostic platform, reducing time-to-mitigate and conserving engineering resources.
  • Unlike monitoring-focused platforms such as Datadog and New Relic, ARCAS employs an LLM to prioritize Auto-TSG outputs and initiate appropriate actions, eliminating the need for manual intervention or comprehensive system knowledge.
  • The platform has been successfully implemented across multiple products in Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.

Source: Automated Root Cause Analysis System for Complex Data Products

Challenges of Software Engineering with Large Language Models

A comprehensive analysis of the challenges facing software engineering as it integrates large language models (LLMs), based on discussions with over 20 experts from academia and industry.

  • The integration of LLMs in software engineering (LLM4SE) is transforming the software development life cycle, offering potential for reduced human effort across various development activities.
  • The study identifies 26 key challenges across seven aspects of software development, including requirements and design, coding assistance, testing, code review, maintenance, vulnerability management, and data handling.
  • These challenges stem from the paradigm shift brought about by LLMs' unprecedented capacity to understand, generate, and operate programming languages.
  • The research aims to benefit future LLM4SE studies by highlighting areas that require attention and innovation in this rapidly evolving field.

Source: The Current Challenges of Software Engineering in the Era of Large Language Models

LLM-based Test Generators: Challenges in Bug Detection

A critical examination of LLM-based test generation tools reveals potential limitations in their ability to detect bugs effectively.

  • Recent tools like Codium CoverAgent and CoverUp, designed for automated test case generation, may unintentionally validate faulty code.
  • The study focuses on a crucial aspect: these tools' test oracles are designed to pass, potentially conflicting with the primary goal of exposing bugs through failing test cases.
  • Evaluation using real human-written buggy code demonstrates that LLM-generated tests can miss bugs and, more concerningly, validate them within the generated test suite.
  • The research raises questions about the design principles behind LLM-based test generation tools and their impact on software quality assurance.

Source: Design choices made by LLM-based test generators prevent them from finding bugs

Generative AI Toolkit: Automating LLM-Based Application Workflows

A framework for improving the quality and efficiency of LLM-based applications throughout their lifecycle, addressing the challenges of manual, slow, and trial-and-error-based development processes.

  • The toolkit automates essential workflows for configuring, testing, monitoring, and optimizing Generative AI applications, including agents.
  • Key benefits include significant quality improvements and shorter release cycles for LLM-based applications.
  • Effectiveness of the toolkit is demonstrated through representative use cases, with best practices shared for implementation.
  • The Generative AI Toolkit is open-sourced, encouraging adoption, adaptation, and improvement by other development teams.

Source: Generative AI Toolkit -- a framework for increasing the quality of LLM-based applications over their whole life cycle

Visual Code Assistants: Integrating Sketches into IDEs for ML Workflows

A study exploring the integration of Visual Code Assistants in IDEs, focusing on converting sketches into code for machine learning workflows.

  • The research involved 19 data scientists, examining their sketching patterns when developing ML workflows. Diagrams were the preferred organizational component (52.6%), followed by lists (42.1%) and numbered points (36.8%).
  • A prototype Visual Code Assistant was developed to convert sketches into Python notebooks using an LLM. The quality of generated code was evaluated using an LLM-as-judge setup.
  • Results showed that even brief sketching could effectively generate useful code outlines, with a positive correlation between sketch time and code quality.
  • Interviews with participants revealed promising applications for Visual Code Assistants in education, prototyping, and collaborative settings.
  • The study suggests potential for the next generation of Code Assistants to integrate visual information, improving code generation and leveraging developers' existing sketching practices.

Source: An Exploratory Study of ML Sketches and Visual Code Assistants

LLMs in Mission-Critical IT Governance: A Survey of Practitioner Perspectives

A survey exploring the potential use of Large Language Models (LLMs) in governing mission-critical IT systems, focusing on security practitioners' views and concerns.

  • The study aims to provide insights for researchers, practitioners, and policymakers on using generative AI in mission-critical systems (MCSs) governance.
  • Survey data collected from developers and security personnel will help identify trends, challenges, and opportunities for introducing LLMs in MCS governance.
  • Findings emphasize the need for interdisciplinary collaboration to ensure safe use of LLMs in MCS governance.
  • Researchers should focus on developing regulation-oriented models and addressing accountability issues.
  • Practitioners prioritize data protection and transparency, while policymakers are urged to establish a unified AI framework with global benchmarks for ethical and secure LLM-based MCS governance.

Source: On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?