[AI Dev Tools] Turn Cursor to Devin, LLM Generated Tests Readability, Automated Code Review...
![[AI Dev Tools] Turn Cursor to Devin, LLM Generated Tests Readability, Automated Code Review...](/content/images/size/w960/2024/12/Screenshot_1.jpg)
Devin.cursorrules: Enhancing Cursor/Windsurf IDE with Advanced AI Capabilities
Devin.cursorrules transforms Cursor or Windsurf IDE into a Devin-like experience, adding advanced agentic AI capabilities in minutes.
Key Features:- Enhances Cursor or Windsurf IDE with process planning, self-evolution, and extended tool usage capabilities.
- Includes web scraping with JavaScript support using Playwright, search engine integration via DuckDuckGo, and LLM-powered text analysis.
- Provides automated execution for Windsurf users in Docker containers.
- Simple setup process using Python virtual environment and pip for dependency management.
- Comprehensive unit tests for all included tools, ensuring reliability and functionality.
DeepCRCEval: Improved Evaluation for Code Review Comment Generation
A framework for evaluating automated code review comment generation, addressing limitations in traditional text similarity-based methods.
- Current benchmarks for code review comments have low quality, with less than 10% suitable for automation.
- DeepCRCEval incorporates human evaluators and LLMs to assess comment quality based on criteria from research and developer interviews.
- The framework proves more reliable than text similarity metrics in distinguishing high and low-quality comments.
- Integration of LLM evaluators reduces evaluation time by 88.78% and cost by 90.32%.
- LLM-Reviewer, a new baseline leveraging few-shot learning, shows promise in generating target-oriented comments.
Source: DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation
CodeRepoQA: Software Engineering Question-Answering Benchmark
CodeRepoQA is a large-scale benchmark for evaluating repository-level question-answering capabilities in software engineering. It covers five programming languages and various scenarios to comprehensively assess language models.
- The dataset contains 585,687 multi-turn question-answering entries, with an average of 6.62 dialogue turns per entry.
- Data was crawled from 30 well-known GitHub repositories and carefully filtered.
- Evaluation of ten popular LLMs revealed limitations in their software engineering question-answering capabilities.
- Medium-length contexts were found to be more conducive to LLM performance.
- The benchmark is publicly available on GitHub for further research and development in the field.
Source: CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering
A2H Converter: Automated Android to HarmonyOS UI Migration
A2H Converter is an automated tool that migrates Android user interfaces to HarmonyOS, addressing the challenge of developing applications for multiple platforms.
- The tool employs an LLM-driven multi-agent framework to convert Android XML layouts into HarmonyOS ArkUI layouts.
- It uses RAG (Retrieval-Augmented Generation) combined with decision rules to map Android UI components to ArkUI equivalents.
- A reflective mechanism continuously improves conversion accuracy.
- A2H Converter handles project-level layouts, ensuring consistency across multiple files and addressing complex UI logic.
- Experiments on six Android applications from GitHub demonstrated migration success rates of over 90.1%, 89.3%, and 89.2% at the component, page, and project levels, respectively.
- http://124.70.54.129:37860/ (inactive at the time of publishing)
Source: A2H: A UI Converter from Android to HarmonyOS Platform
Improving Readability of Automatically Generated Tests with LLMs
A method to enhance the readability of search-based test generators' output using large language models (LLMs), while maintaining their high code coverage.
- Search-based test generators produce effective unit tests with high coverage but lack meaningful names, making them difficult to understand.
- The approach focuses on improving test and variable names without altering the tests' semantics or coverage.
- Evaluation on nine industrial and open-source LLMs showed that the readability improvements are semantically-preserving and consistent across multiple iterations.
- A human study with ten professional developers found that the LLM-improved tests were as readable as developer-written tests, regardless of the LLM used.
Source: Improving the Readability of Automatically Generated Tests using Large Language Models
LLM-based Automated Code Review: Industrial Impact Study
A study examining the effects of LLM-based automated code review tools in an industrial setting, focusing on their impact on software quality, developer experience, and workflow efficiency.
- The research analyzed 4,335 pull requests across three projects, with 1,568 undergoing automated reviews using an AI-assisted tool based on open-source Qodo PR Agent.
- Data collection involved quantitative analysis of pull request data, developer surveys on individual pull requests, and a broader survey of 22 practitioners' opinions on automated reviews.
- Results showed 73.8% of automated comments were resolved, but average pull request closure duration increased from 5 hours 52 minutes to 8 hours 20 minutes.
- Most practitioners reported minor improvements in code quality due to automated reviews, with benefits including enhanced bug detection, increased code quality awareness, and promotion of best practices.
- Drawbacks of the LLM-based tool included longer pull request closure times, faulty reviews, unnecessary corrections, and irrelevant comments.
Source: Automated Code Review In Practice
Agentable: Detecting Defects in LLM-based AI Agents
A study and tool for identifying and detecting defects in AI agents powered by large language models (LLMs), focusing on discrepancies between developer-implemented logic and LLM-generated content.
- The research analyzed 6,854 StackOverflow posts to define 8 types of agent defects, providing detailed descriptions and examples for each.
- Agentable, a static analysis tool, was developed to detect these defects using Code Property Graphs and LLMs to analyze agent workflows.
- Two datasets were created for evaluation: AgentSet (84 real-world agents) and AgentTest (78 agents with designed defects). Agentable achieved 88.79% accuracy and 91.03% recall in defect detection.
- Analysis of AgentSet revealed 889 defects, highlighting the prevalence of these issues in LLM-based AI agents.
Source: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents
SimilarGPT: Smart Contract Vulnerability Detection Using GPT and Code Similarity
SimilarGPT is a tool for identifying vulnerabilities in smart contracts by combining GPT models with code-based similarity checking methods.
- The tool measures similarity between the inspected code and secure code from third-party libraries to detect potential vulnerabilities.
- Topological ordering optimizes the detection sequence, enhancing logical coherence and reducing false positives.
- A comprehensive reference codebase is established through analysis of code reuse patterns in smart contracts.
- LLMs conduct in-depth analysis of similar codes to identify and explain potential vulnerabilities.
- Experimental results show SimilarGPT excels in detecting vulnerabilities, particularly in reducing missed detections and false positives.
Prompt Evolution in LLM-Integrated Software: An Empirical Study
A study analyzing 1,262 prompt changes across 243 GitHub repositories to understand how developers manage and evolve prompts in LLM-integrated applications.
- Developers primarily evolve prompts through additions and modifications, with most changes occurring during feature development.
- Only 21.9% of prompt changes are documented in commit messages, highlighting a significant gap in documentation practices.
- Key challenges in prompt engineering include the introduction of logical inconsistencies and misalignment between prompt changes and LLM responses.
- The findings emphasize the need for specialized testing frameworks, automated validation tools, and improved documentation practices to enhance the reliability of LLM-integrated applications.
Source: Prompting in the Wild: An Empirical Study of Prompt Evolution in Software Repositories
LLM Performance Analysis for Code Summarization
A comparative study of open-source LLMs (LLaMA-3, Phi-3, Mistral, and Gemma) evaluates their performance in generating concise natural language descriptions for source code.
- Code summarization aims to create brief, accurate descriptions of source code, an increasingly important task in software engineering.
- The analysis focuses on assessing the performance of four open-source LLMs: LLaMA-3, Phi-3, Mistral, and Gemma.
- Performance evaluation uses key metrics including BLEU and ROUGE scores to measure the quality and accuracy of generated summaries.
- The study's findings are expected to provide insights into each model's strengths and weaknesses, contributing to the ongoing development of LLMs for code summarization tasks.
Source: Analysis on LLMs Performance for Code Summarization
Trust Calibration for AI Refactoring in IDEs
A position paper advocating for the integration of LLM-based refactoring in IDEs with a focus on trust development and safeguards.
- LLMs offer a new approach to large-scale code improvement through AI-assisted refactoring, addressing the industry's tendency to prioritize new features over code maintenance.
- The paper highlights inherent risks of LLM use, such as breaking changes and potential security vulnerabilities, emphasizing the need for trustworthy safeguards and IDE encapsulation.
- Future work will be based on established models from human factors in automation research, focusing on developing novel LLM safeguards and user interactions that foster appropriate trust levels.
- The research involves collaboration with CodeScene, enabling large-scale repository analysis and A/B testing to guide the design of research interventions.
Source: Trust Calibration in IDEs: Paving the Way for Widespread Adoption of AI Refactoring
ARCAS: Automated Root Cause Analysis for Complex Data Products
ARCAS is a diagnostic platform using a Domain Specific Language (DSL) for rapid implementation of automated troubleshooting guides in complex data products.
- The system comprises a network of automated troubleshooting guides (Auto-TSGs) that run concurrently, detecting issues and applying near-real-time mitigation using product telemetry.
- ARCAS's DSL allows subject matter experts to create relevant Auto-TSGs quickly, without needing to understand the entire diagnostic platform, reducing time-to-mitigate and conserving engineering resources.
- Unlike monitoring-focused platforms such as Datadog and New Relic, ARCAS employs an LLM to prioritize Auto-TSG outputs and initiate appropriate actions, eliminating the need for manual intervention or comprehensive system knowledge.
- The platform has been successfully implemented across multiple products in Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.
Source: Automated Root Cause Analysis System for Complex Data Products
Challenges of Software Engineering with Large Language Models
A comprehensive analysis of the challenges facing software engineering as it integrates large language models (LLMs), based on discussions with over 20 experts from academia and industry.
- The integration of LLMs in software engineering (LLM4SE) is transforming the software development life cycle, offering potential for reduced human effort across various development activities.
- The study identifies 26 key challenges across seven aspects of software development, including requirements and design, coding assistance, testing, code review, maintenance, vulnerability management, and data handling.
- These challenges stem from the paradigm shift brought about by LLMs' unprecedented capacity to understand, generate, and operate programming languages.
- The research aims to benefit future LLM4SE studies by highlighting areas that require attention and innovation in this rapidly evolving field.
Source: The Current Challenges of Software Engineering in the Era of Large Language Models
LLM-based Test Generators: Challenges in Bug Detection
A critical examination of LLM-based test generation tools reveals potential limitations in their ability to detect bugs effectively.
- Recent tools like Codium CoverAgent and CoverUp, designed for automated test case generation, may unintentionally validate faulty code.
- The study focuses on a crucial aspect: these tools' test oracles are designed to pass, potentially conflicting with the primary goal of exposing bugs through failing test cases.
- Evaluation using real human-written buggy code demonstrates that LLM-generated tests can miss bugs and, more concerningly, validate them within the generated test suite.
- The research raises questions about the design principles behind LLM-based test generation tools and their impact on software quality assurance.
Source: Design choices made by LLM-based test generators prevent them from finding bugs
Generative AI Toolkit: Automating LLM-Based Application Workflows
A framework for improving the quality and efficiency of LLM-based applications throughout their lifecycle, addressing the challenges of manual, slow, and trial-and-error-based development processes.
- The toolkit automates essential workflows for configuring, testing, monitoring, and optimizing Generative AI applications, including agents.
- Key benefits include significant quality improvements and shorter release cycles for LLM-based applications.
- Effectiveness of the toolkit is demonstrated through representative use cases, with best practices shared for implementation.
- The Generative AI Toolkit is open-sourced, encouraging adoption, adaptation, and improvement by other development teams.
Visual Code Assistants: Integrating Sketches into IDEs for ML Workflows
A study exploring the integration of Visual Code Assistants in IDEs, focusing on converting sketches into code for machine learning workflows.
- The research involved 19 data scientists, examining their sketching patterns when developing ML workflows. Diagrams were the preferred organizational component (52.6%), followed by lists (42.1%) and numbered points (36.8%).
- A prototype Visual Code Assistant was developed to convert sketches into Python notebooks using an LLM. The quality of generated code was evaluated using an LLM-as-judge setup.
- Results showed that even brief sketching could effectively generate useful code outlines, with a positive correlation between sketch time and code quality.
- Interviews with participants revealed promising applications for Visual Code Assistants in education, prototyping, and collaborative settings.
- The study suggests potential for the next generation of Code Assistants to integrate visual information, improving code generation and leveraging developers' existing sketching practices.
Source: An Exploratory Study of ML Sketches and Visual Code Assistants
LLMs in Mission-Critical IT Governance: A Survey of Practitioner Perspectives
A survey exploring the potential use of Large Language Models (LLMs) in governing mission-critical IT systems, focusing on security practitioners' views and concerns.
- The study aims to provide insights for researchers, practitioners, and policymakers on using generative AI in mission-critical systems (MCSs) governance.
- Survey data collected from developers and security personnel will help identify trends, challenges, and opportunities for introducing LLMs in MCS governance.
- Findings emphasize the need for interdisciplinary collaboration to ensure safe use of LLMs in MCS governance.
- Researchers should focus on developing regulation-oriented models and addressing accountability issues.
- Practitioners prioritize data protection and transparency, while policymakers are urged to establish a unified AI framework with global benchmarks for ethical and secure LLM-based MCS governance.
Source: On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?