10 Oct 2024 17 min read

[AI Dev Tools] Test Generation, UI Migration, LLM-Powered Analysis, Autonomous Coding ...

source: https://arxiv.org/pdf/2409.11190v1

TestGenEval: A Benchmark for Unit Test Generation and Completion

TestGenEval is a large-scale benchmark designed to measure test generation performance in software development, addressing the gap in evaluating LLMs for software testing tasks.

The benchmark comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories, based on SWEBench.
TestGenEval covers three main aspects: initial test authoring, test suite completion, and code coverage improvements.
Several popular models, ranging from 7B to 405B parameters, were evaluated using TestGenEval.
Results show models struggle with generating high-coverage test suites, with the best model (GPT-4o) achieving an average coverage of only 35.2%.
Key challenges for models include reasoning about execution and handling assertion errors when dealing with complex code paths.

Tools you can use from the paper:

https://figshare.com/s/51171ae97cd21d233d4f

Source: TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

IconDesc: LLM-Based Alt-Text Generation for UI Icons

A method for generating informative alt-text for mobile UI icons using LLMs, designed to improve accessibility for visually impaired users during app development.

IconDesc addresses the challenge of creating meaningful alt-text for UI icons, which is essential for screen reader users but often overlooked.
The approach utilizes icon context, including class, resource ID, bounds, OCR-detected text, and information from parent and sibling nodes.
An off-the-shelf LLM is fine-tuned on a small dataset of approximately 1,400 icons to create IconDesc.
Empirical evaluation and user studies show significant improvements in generating relevant alt-text compared to traditional methods.
IconDesc serves as a valuable tool for developers, enabling rapid iteration and enhancement of UI accessibility during the app development process.

Source: Infering Alt-text For UI Icons With Large Language Models During App Development

SERGUI: Self-Elicitation of Requirements with Automated GUI Prototyping

SERGUI is an approach enabling self-elicitation of requirements through automated GUI prototyping, aimed at streamlining the early stages of software development.

The system leverages a large-scale GUI repository and natural language requirements for GUI retrieval, facilitating rapid feedback through prototypes.
An LLM drives the prompting-based recommendation of GUI features, stimulating the elicitation of additional requirements.
SERGUI addresses common challenges in traditional GUI prototyping, such as the need for experienced analysts and multiple customer sessions.
Intended for use in the initial requirements elicitation phase, SERGUI generates an initial GUI prototype specification for analyst-customer communication.
A preliminary evaluation has been conducted to assess the effectiveness of the approach, with a video demonstration available online.

Source: Self-Elicitation of Requirements with Automated GUI Prototyping

Code-Survey: LLM-Driven Analysis of Large-Scale Codebases

Code-Survey is a methodology that uses LLMs to systematically explore and analyze large-scale codebases, treating LLMs as human participants in software development.

The approach transforms unstructured data like commits and emails into structured, analyzable datasets through carefully designed surveys.
Application to the Linux kernel's eBPF subsystem resulted in the Linux-bpf dataset, containing over 670 features and 16,000 commits.
Quantitative analysis revealed insights into eBPF evolution, including development patterns, feature interdependencies, and areas needing attention for reliability and security.
Code-Survey is versatile and can be applied to other subsystems within Linux and other large-scale software projects.
The methodology facilitates deeper understanding of complex software systems, supporting improvements across various domains and enabling a wide range of empirical studies.

Tools you can use from the paper:

https://github.com/eunomia-bpf/code-survey

Source: Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale Codebases

ConGra: Benchmarking Automatic Conflict Resolution in Software Merging

ConGra is a conflict-graded benchmarking scheme for evaluating software merging tools' performance in resolving conflicts of varying complexity.

The scheme addresses the lack of effective conflict difficulty grading methods and large-scale open benchmarks for evaluating LLMs in automatic conflict resolution.
ConGra introduces a novel approach to classify conflicts based on code operations, creating a comprehensive evaluation dataset from 44,948 conflicts across 34 real-world projects.
The benchmark was used to assess the performance of state-of-the-art LLMs and code LLMs in conflict resolution tasks, revealing unexpected insights.
ConGra aims to provide a deeper understanding of LLMs' limitations in resolving software merge conflicts and will be available on GitHub.

Tools you can use from the paper:

https://github.com/HKU-System-Security-Lab/ConGra

Source: CONGRA: Benchmarking Automatic Conflict Resolution

GitHub Profile Recruitment Bias in LLMs

A study exploring bias in LLMs when automating recruitment tasks for geographically diverse software teams using GitHub user profiles.

The research used OpenAI's ChatGPT to analyze 3,657 GitHub profiles from four regions over a five-year period (2019-2023).
ChatGPT demonstrated preferences for certain regions when selecting candidates for a six-person software development team.
Bias persisted even when location strings were swapped between profiles, indicating deeper underlying biases in the model.
The LLM showed a tendency to assign specific developer roles based on a user's country of origin.
These findings highlight the need for addressing and mitigating societal biases in LLMs, particularly in recruitment and team-building contexts.

Tools you can use from the paper:

https://zenodo.org/records/10578705

Source: Nigerian Software Engineer or American Data Scientist? GitHub Profile Recruitment Bias in Large Language Models

ContractTinker: LLM-Powered Smart Contract Vulnerability Repair

ContractTinker is a tool that leverages LLMs to repair vulnerabilities in smart contracts, addressing the challenges developers face when fixing security issues identified by third-party audits.

The tool uses a Chain-of-Thought approach, breaking down the repair process into manageable sub-tasks for improved accuracy.
Program static analysis is integrated to guide the LLM and reduce hallucinations during the repair process.
In tests with 48 high-risk vulnerabilities, ContractTinker generated valid patches for 48% of cases, with an additional 21% requiring only minor modifications.
A demonstration video of ContractTinker is available on YouTube, showcasing its capabilities in real-world scenarios.

Tools you can use from the paper:

https://youtu.be/HWFVi-YHcPE

Source: ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts

LLM-Generated Reports: Enhancing DevSecOps Security Responsiveness

A study explores the use of LLM-generated security reports to combat alert fatigue in DevSecOps teams, particularly in resource-limited environments.

Alert fatigue in DevSecOps leads to decreased responsiveness to security warnings, potentially exposing systems to vulnerabilities.
LLM-generated reports emphasize financial impacts and consequences of unaddressed security issues, such as credential leaks.
Developer survey results indicate these reports significantly increase the likelihood of immediate action on security concerns.
Integration of LLM-generated reports into DevSecOps workflows can help mitigate attention saturation and ensure critical warnings are addressed effectively.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: The potential of LLM-generated reports in DevSecOps

Human-LLM Interaction in Programming Tasks: A Literature Survey

A comprehensive review of Large Language Models (LLMs) and their impact on programming practices, focusing on user studies that assess LLM use in coding tasks.

The survey examines user interaction behaviors with LLMs, including types of requests made and task completion strategies.
Analysis reveals both benefits and weaknesses of LLMs, showing mixed effects on human programmers and task performance.
Factors influencing human enhancement and task performance are explored, considering aspects of the human, LLM, and their interaction.
Findings highlight the variability in human-LLM interactions due to the non-deterministic nature of both parties, emphasizing the need for deeper understanding of these patterns.
The paper concludes with practical suggestions for researchers and programmers working with LLMs in coding tasks.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

AI-Driven Programming Assistants: A Comparative Study

A benchmark study comparing ChatGPT, Codeium, and GitHub Copilot's performance on LeetCode problems across various difficulty levels and categories.

Performance metrics included success rates, runtime efficiency, memory usage, and error-handling capabilities.
GitHub Copilot demonstrated superior performance on easier and medium tasks.
ChatGPT excelled in memory efficiency and debugging capabilities.
Codeium showed promise but struggled with more complex problems.
All tools faced challenges in handling harder problems, highlighting limitations in AI-driven programming assistance.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants

Object-Oriented Programming in AI and Data Science

A comprehensive guide on integrating Object-Oriented Programming (OOP) techniques in machine learning, deep learning, LLMs, and data analytics, focusing on improving code modularity, maintainability, and scalability.

The work outlines the evolution of computing and the rise of OOP, discussing key principles such as encapsulation, inheritance, polymorphism, and abstraction.
Python is used to demonstrate practical applications of OOP principles, given its widespread adoption in AI and data science.
Design patterns and modular programming are examined for enhancing the structure and efficiency of machine learning systems.
Real-world AI tasks, including preprocessing workflows, model training, and evaluation, are used to illustrate OOP concepts.
The guide aims to equip both beginners and experienced developers with knowledge to apply OOP methodologies in AI-driven projects, fostering more robust and maintainable systems.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming

LLM-Driven Explanations for Quantum Algorithms: A Comparative Analysis

A study analyzing how LLMs can support developers in understanding quantum code, comparing explanations from GPT-3.5, Llama2, and TinyLlama for seven quantum algorithms.

Llama2 provided the highest quality explanations from scratch, while GPT-3.5 excelled at improving existing explanations.
Adding a small amount of context to the prompt significantly enhanced the quality of explanations across all LLMs.
Explanations remained qualitatively and syntactically consistent over multiple rounds of generation.
The study used two different human-written prompt styles to analyze and compare the quality of explanations.
Future research directions include prompt optimization, parsing of quantum code explanations, and systematic assessment of explanation quality.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Exploring LLM-Driven Explanations for Quantum Algorithms

TestBench: Evaluating LLMs' Class-Level Test Case Generation

TestBench is a benchmark for assessing the capabilities of LLMs in generating class-level test cases for Java programs. It includes a dataset of 108 Java programs from 9 real-world projects and a comprehensive evaluation framework.

The dataset spans 9 thematic domains, sourced from large-scale GitHub projects.
Three types of prompts are used: self-contained context, full context, and simple context.
Evaluation considers five aspects: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
A heuristic algorithm is proposed to repair erroneous test cases generated by LLMs.
Experiments with CodeLlama-13b, GPT-3.5, and GPT-4 show that larger models better utilize contextual information, while smaller models benefit from simplified contexts derived through abstract syntax tree analysis.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

UTRefactor: Context-Enhanced LLM-Based Test Refactoring Framework

UTRefactor is a framework for automatic test refactoring in Java projects, leveraging LLMs and contextual information to eliminate test smells and improve code quality.

The framework extracts relevant context from test code and utilizes an external knowledge base containing test smell definitions, descriptions, and DSL-based refactoring rules.
UTRefactor employs a chain-of-thought approach to simulate manual refactoring, guiding the LLM through a step-by-step process for accurate and consistent smell elimination.
A checkpoint mechanism facilitates comprehensive refactoring, particularly effective when multiple smells are present in the code.
Evaluation on 879 tests from six open-source Java projects showed an 89% reduction in test smells, outperforming direct LLM-based refactoring methods by 61.82% and surpassing rule-based tools.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Context-Enhanced LLM-Based Framework for Automatic Test Refactoring

GUIMIGRATOR: Android to iOS UI Migration Tool

GUIMIGRATOR is a novel approach for automating the migration of Android app user interfaces (UIs) to iOS, reducing development time and resource requirements.

The tool extracts and parses Android UI layouts, views, and resources to construct a UI skeleton tree.
It generates iOS UI code files using target code templates, which are then compiled and validated in Xcode.
Evaluation on 31 Android open-source applications across ten domains showed a UI similarity score of 78 between migration screenshots, outperforming two popular existing LLMs.
GUIMIGRATOR demonstrated high efficiency, completing migrations in an average of 7.6 seconds.
The approach avoids errors associated with screenshot recognition methods and reduces the cost of developing UIs from scratch.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: A Rule-Based Approach for UI Migration from Android to iOS

Empirical Study of Issues in LLM Open-Source Projects

A comprehensive analysis of challenges, causes, and solutions in open-source projects utilizing Large Language Models (LLMs) as core components.

The study examined 994 closed issues from 15 LLM open-source projects to identify prevalent problems, their root causes, and potential resolutions.
Model Issues emerged as the most common challenge faced by practitioners in LLM open-source projects.
Three primary causes of issues were identified: Model Problems, Configuration and Connection Problems, and Feature and Method Problems.
Optimizing the model was found to be the predominant solution to address the identified issues.
The research provides valuable insights for practitioners and researchers working on LLM open-source projects, offering guidance for improving development and usage practices.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Demystifying Issues, Causes and Solutions in LLM Open-Source Projects

AutoAPIEval: Framework for Evaluating API-oriented Code Generation in LLMs

AutoAPIEval is a lightweight, automated framework designed to assess LLMs' capabilities in API-oriented code generation, focusing on API recommendation and code example generation tasks.

The framework addresses the gap in evaluating LLMs for API-oriented code generation, working with any library that provides API documentation.
Four metrics are used to evaluate generated APIs and code examples, including the proportion of incorrect API recommendations and uncompilable/unexecutable code examples.
A case study conducted on ChatGPT, MagiCoder, and DeepSeek Coder using Java Runtime Environment 8 demonstrated the framework's effectiveness.
Findings revealed variability in LLM performance across tasks, with ChatGPT showing better adherence to instructions but similar effectiveness in code example generation compared to other models.
Key factors affecting code quality were identified, including API popularity and model confidence, with classifiers built to detect incorrect API recommendations and erroneous code examples.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

PackageIntel: Automated Malicious Package Intelligence Extraction

A platform that automates the collection, processing, and retrieval of malicious package intelligence in public registries, addressing software supply chain security threats.

Utilizes exhaustive search techniques and snowball sampling from diverse sources to ensure enhanced coverage and timeliness.
Employs LLMs with specialized prompts for accurate intelligence extraction, achieving 98.6% precision and 92.0% F1 score.
Developed a comprehensive database of 20,692 malicious NPM and PyPI packages from 21 distinct intelligence repositories.
Detects threats on average 70% earlier than leading databases like Snyk and OSV, operating cost-effectively at $0.094 per intelligence piece.
Successfully identified and reported over 1,000 malicious packages in downstream package manager mirror registries.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

Code Comment Quality Assessment: LLMs vs Expert-Generated for Novice Programmers

A study evaluating the instructional quality of code comments generated by LLMs for novice programmers, comparing them to expert-developed comments using a dataset of "easy" level Java solutions from LeetCode.

GPT-4 demonstrates comparable quality to expert comments in key aspects for beginners, including clarity, beginner-friendliness, concept elucidation, and step-by-step guidance.
GPT-4 outperforms Llama2 in discussing code complexity (chi-square = 11.40, p = 0.001).
Perceived as significantly more supportive for beginners, GPT-4 surpasses both GPT-3.5 and Llama2 (Mann-Whitney U-statistics = 300.5 and 322.5, p = 0.0017 and 0.0003, respectively).
The study highlights the potential of LLMs in generating code comments tailored to novice programmers' needs.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

SAIL: Skill-Adaptive Imitation Learning for UI Test Migration

SAIL is a framework that enhances UI test migration effectiveness through skill-adaptive imitation learning, addressing the limitations of traditional event-mapping approaches.

UI test migration aims to automatically generate test cases for target mobile apps by adapting tests from similar source apps, reducing manual crafting costs.
Traditional approaches focus on sequential UI-event-mapping, but even highly accurate LLM-driven solutions struggle with implementation discrepancies between source and target apps.
SAIL employs multi-level abstraction of test cases' underlying skills, using source tests as demonstrations to build a knowledge base for target app test generation.
The framework's novel context- and history-aware skill adaptation selectively reuses learned skills to guide test case generation for the target app.
Evaluation results show SAIL achieves a 149% higher success rate than state-of-the-art approaches in UI test migration effectiveness.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Skill-Adpative Imitation Learning for UI Test Reuse

LLMs for Manual Test Verification Generation: An Exploratory Study

A study exploring the effectiveness of LLMs in generating verifications for manual software tests, comparing open-source and closed-source models and assessing professional testers' perceptions.

Open-source models Mistral-7B and Phi-3-mini-4k showed comparable effectiveness to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications.
Professional testers' agreement level with LLM-generated verifications was slightly above 40%, indicating potential but significant room for improvement.
The study generated a dataset of 37,040 test verifications using 8 different LLMs, providing a valuable resource for further research in this area.
Concerns were raised about AI hallucinations, where some generated verifications significantly deviated from expectations.
Results suggest the need for refinement in accuracy, relevance, and clarity of LLM-generated verifications to enhance reliability in real-world testing scenarios.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: On the Effectiveness of LLMs for Manual Test Verifications

Bots in Software Engineering: A Multivocal Literature Review

A comprehensive study on the use of bots and conversational agents in software engineering, exploring their motivations, challenges, best practices, and benefits.

Bots are software systems that automate specific processes, tasks, or activities. Conversational agents are bots with a conversational component for user interaction.
The adoption of AI-powered bots in software development has increased over time. However, practitioners report that bots can introduce additional challenges, potentially worsening rather than improving the development process.
The study provides a taxonomy for characterizing bots and outlines challenges associated with their adoption in software engineering, along with potential mitigation strategies.
The research methodology involves a multivocal literature review, examining both academic research and practitioner literature to bridge the gap between theory and practice.
The review aims to contribute to both research and practice by identifying future research directions, providing strategies for improving bot usage in software engineering, and facilitating knowledge transfer between academia and industry.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review

LLM-Agent-UMF: Unified Modeling Framework for LLM-based Agents

A framework that provides a unified software architecture for LLM-based agents, addressing modularity issues and architectural ambiguities in existing systems.

The framework introduces the concept of a "core-agent" as the central coordinator, comprising five modules: planning, memory, profile, action, and security.
Core-agents are classified into passive and active types, allowing for various multi-core agent architectures that combine unique characteristics of individual agents.
LLM-Agent-UMF clearly distinguishes between different components of an agent, setting LLMs and tools apart from the core-agent.
The framework was applied to state-of-the-art agents, demonstrating its alignment with existing functionalities while clarifying overlooked architectural aspects.
Evaluation of four proposed architectures integrating distinctive agents into hybrid active/passive core-agents' systems provided insights into potential improvements and combination challenges.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents

SuperCoder2.0: Autonomous AI-Powered Software Development System

SuperCoder2.0 is an advanced autonomous system designed to enhance software development through AI, combining intelligent agents and AI-native development approaches for fully autonomous coding.

The system employs a three-step hierarchical search space reduction approach for code base navigation and bug localization, utilizing Retrieval Augmented Generation (RAG) and various mapping techniques.
Code editing is performed through a two-part module: CodeGeneration and CodeEditing. This module generates multiple solutions at different temperature values and replaces entire methods or classes to maintain code integrity.
Key features include a retry mechanism with error output traceback, comprehensive code rewriting using Abstract Syntax Tree (ast) parsing, and code embedding for retrieval-augmented generation.
Experiments on the SWE-bench Lite dataset show SuperCoder2.0 achieving correct file localization in 84.33% of cases within the top 5 candidates and successfully resolving 34% of test instances.
The system's performance ranks fourth globally on the SWE-bench leaderboard, demonstrating its potential as a versatile tool for autonomous software development across diverse repositories and problem types.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer

Code Comment Inconsistency and Bug Introduction: A GPT-3.5 Analysis

A study investigating the impact of code-comment inconsistencies on bug introduction using GPT-3.5, revealing that inconsistent changes are about 1.5 times more likely to lead to bug-introducing commits.

GPT-3.5 outperforms other state-of-the-art methods in detecting code-comment inconsistencies.
Analysis of temporal evolution shows the impact of inconsistencies on bug proneness is highest immediately after introduction and decreases over time.
The research emphasizes the importance of maintaining up-to-date and consistent code comments in software development.
Findings provide new insights into the relationship between code-comment inconsistency and software quality, offering a comprehensive temporal analysis.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Investigating the Impact of Code Comment Inconsistency on Bug Introducing

VulnLLMEval: Framework for Evaluating LLMs in Software Vulnerability Detection and Patching

A framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code, using a dataset of 307 real-world vulnerabilities from the Linux kernel.

Establishes a benchmark for evaluating LLMs' strengths and limitations in software vulnerability detection (SVD) and patching (SVP) tasks.
The dataset includes both vulnerable and patched code, providing a diverse and representative testbed for rigorous assessment.
Results indicate that LLMs often struggle to distinguish between vulnerable and patched code.
In SVP tasks, models tend to oversimplify code, producing solutions that may require further refinement before practical use.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Context-aware C-to-Rust Translation Using LLMs

A translation scheme that improves the success rate of converting large-scale C code into compilable Rust code using large language models (LLMs).

The approach addresses the challenge of translating C to Rust, motivated by the need for memory-safe alternatives to existing C programs.
Three key techniques are employed: pre-processing C code to align with Rust structure, segmenting code into optimal translation units, and iterative compilation and error repair.
Context-supplementing prompts maintain consistency between translation units, overcoming LLM context window limitations.
Experiments with 20 benchmark C programs, including those exceeding 4,000 lines of code, resulted in successful translation to compilable Rust code without loss of original functionality.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models

Large-Scale Privacy Assessment of Android Third-Party SDKs

A comprehensive study analyzing privacy protection practices in Android third-party Software Development Kits (SDKs), focusing on data exfiltration and behavior-policy compliance.

The study examined 158 widely-used SDKs from two key release platforms, identifying 338 instances of privacy data exfiltration.
Over 30% of the examined SDKs failed to provide a privacy policy disclosing their data handling practices.
Among SDKs with privacy policies, 37% over-collected user data, and 88% falsely claimed access to sensitive data.
A 12-month follow-up analysis showed no significant improvement in these concerning trends.
The research proposes three actionable recommendations to mitigate privacy leakage risks and enhance protection for Android users.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: A Large-Scale Privacy Assessment of Android Third-Party SDKs

LLMs for Software Project Cost and Duration Prediction

A study exploring the use of LLMs to improve the accuracy and usability of software project cost and duration estimates.

The research compares LLMs to traditional estimation methods and contemporary machine learning techniques.
Key focus areas include LLMs' performance against existing models, ease of integration into current practices, and ability to outperform traditional estimation methods.
The study applies LLMs to real-world datasets, aiming to demonstrate their superior accuracy and user-friendliness compared to complex predictive models.
Potential implications include transforming project management strategies in the software industry by offering a more accessible and accurate estimation tool.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects

LLM-based Agents in Software Engineering: A Survey and Framework

A comprehensive survey of Large Language Model (LLM)-based agents in software engineering, presenting a framework and discussing challenges and opportunities in this emerging field.

The study examines the growing use of LLMs in software engineering tasks, noting the increasing adoption of agent-like approaches.
A framework for LLM-based agents in software engineering is proposed, consisting of three key modules: perception, memory, and action.
Current challenges in combining LLM-based agents with software engineering are summarized, along with potential future opportunities.
A GitHub repository (https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE) maintains a collection of related papers for further reference.

Tools you can use from the paper:

No implementation tools or repository links are provided.

Source: Agents in Software Engineering: Survey, Landscape, and Vision