[AI Dev Tools] LLM-Driven E2E Testing, Accessibility Fix Suggestions, Real-World Code Completion Evaluation ...
![[AI Dev Tools] LLM-Driven E2E Testing, Accessibility Fix Suggestions, Real-World Code Completion Evaluation ...](/content/images/size/w960/2024/08/Screenshot_5.jpg)
AUTOE2E: Feature-Driven E2E Test Generation for Web Applications
AUTOE2E is an approach that uses LLMs to generate semantically meaningful, feature-driven end-to-end test cases for web applications.
- The system infers potential features within a web application and translates them into executable test scenarios.
- E2EBENCH, a new benchmark, is introduced to assess the feature coverage of E2E test suites automatically.
- Evaluation on E2EBENCH shows AUTOE2E achieves an average feature coverage of 79%, outperforming the best baseline by 558%.
- This approach addresses the limitations of manual test creation and random test generation techniques.
Source: A Feature-Based Approach to Generating Comprehensive End-to-End Tests
FixAlly: Automated Code Fix Suggestions for Mobile App Accessibility Issues
FixAlly is an automated tool that suggests source code fixes for accessibility issues in mobile apps, addressing the challenges developers face in identifying and resolving these problems.
- The tool employs a multi-agent LLM architecture to generate fix strategies, localize issues within the source code, and propose code modifications.
- An empirical study showed FixAlly's effectiveness, with 77% of its suggestions being plausible fixes for issues identified by accessibility scanners.
- A survey of 12 iOS developers revealed they would be willing to accept 69.4% of the evaluated fix suggestions, indicating the tool's practical value.
- FixAlly aims to bridge the gap between issue detection and resolution, addressing the limitations of current accessibility testing tools that may not provide adequate guidance on fixes.
Source: Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps
RepoMasterEval: A Benchmark for Real-World Code Completion Evaluation
RepoMasterEval is a novel benchmark for evaluating code completion models, constructed from real-world Python and TypeScript repositories.
- The benchmark addresses limitations in existing code completion evaluations, which often focus on function and class-level tasks with rich text descriptions.
- Each benchmark datum is created by masking a code snippet from a source file with existing test suites, better aligning with practical code completion scenarios.
- Mutation testing is employed to measure test case effectiveness, with manual augmentation of test suites that have low mutation scores.
- Empirical evaluation of 6 state-of-the-art models demonstrates the importance of test augmentation in improving benchmark accuracy.
- A one-month deployment in a collaborating company showed RepoMasterEval's utility in providing accurate feedback during model training, with scores correlating highly to practical performance.
Source: RepoMasterEval: Evaluating Code Completion via Real-World Repositories
CAPE: Chat-like Asserts Prediction for Python Using LLMs
CAPE is an approach that generates meaningful assert statements for Python projects using LLMs and interpreter interaction.
- The system employs persona, Chain-of-Thought, and one-shot learning techniques in prompt design to enhance assert statement generation.
- CAPE conducts multiple rounds of communication between the LLM and Python interpreter to produce effective assert statements.
- Evaluation shows 64.7% accuracy for single assert statement generation and 62% for overall assert statement generation, surpassing existing methods.
- A Python assert statement dataset from GitHub was created to support the research.
- The approach has potential applications in automated Python unit test generation and broader software engineering practices.
Source: Chat-like Asserts Prediction with the Support of Large Language Model
HardEval: A Framework for Assessing LLM Task Difficulty in Programming
HardEval is a framework for evaluating the difficulty of programming tasks for LLMs and creating new challenging tasks based on identified hard problems.
- Current LLM evaluations often use general metrics over benchmarks, which may not accurately reflect the difficulty of individual tasks or the true capabilities of the models.
- The framework uses diverse prompts across multiple LLMs to generate a difficulty score for each task in a benchmark. This approach helps identify truly challenging problems for LLMs.
- Analysis of two code generation benchmarks, HumanEval+ and ClassEval, revealed that only 21% and 27% of their tasks, respectively, are difficult for LLMs.
- HardEval identified six practical hard task topics, which were used to generate new challenging tasks for LLM evaluation and improvement.
- The framework's generalistic approach can be applied to various domains beyond code generation, such as code completion or question-answering tasks.
Source: Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models
Code Documentation Generation: Developer Prompting Effectiveness Study
A controlled experiment examining the ability of developers to effectively prompt LLMs for generating code documentation, comparing ad-hoc prompts with predefined few-shot prompts.
- The study involved 20 professionals and 30 computer science students tasked with generating documentation for two Python functions.
- Participants in the experimental group used ad-hoc prompts in a ChatGPT-like Visual Studio Code extension, while the control group used a predefined few-shot prompt.
- Results showed that both professionals and students were generally unaware of or unable to apply prompt engineering techniques effectively.
- Students perceived documentation from ad-hoc prompts as significantly less readable, concise, and helpful compared to that from prepared prompts.
- Some professionals improved documentation quality by simply including the keyword "Docstring" in their ad-hoc prompts.
- Participants rarely assessed the LLM-generated output as perfect, viewing the tools as aids for iterative documentation refinement.
Source: Can Developers Prompt? A Controlled Experiment for Code Documentation Generation
INSEC: Attack Method for Inducing Vulnerable Code from Completion Engines
INSEC is an attack method that manipulates black-box code completion engines to generate vulnerable code while maintaining functional correctness.
- The attack targets engines powered by large language models, including commercial services like GitHub Copilot and OpenAI API.
- INSEC works by inserting a malicious attack string as a short comment in the completion input, derived through specialized initialization schemes and optimization procedures.
- Tested on 16 Common Weakness Enumerations (CWEs) across 5 programming languages, INSEC increased the likelihood of generating unsafe code by over 50% in absolute terms.
- The attack is resource-efficient, costing less than $10 to develop on commodity hardware.
Source: Practical Attacks against Black-box Code Completion Engines
LiCoEval: License Compliance Evaluation Benchmark for LLMs in Code Generation
A benchmark for evaluating the ability of LLMs to provide accurate license information for generated code, addressing potential intellectual property violations in AI-assisted software development.
- The study establishes a standard for "striking similarity" between LLM-generated code and existing open-source implementations, indicating a copy relationship.
- LiCoEval, the proposed evaluation benchmark, was used to assess 14 popular LLMs for license compliance capabilities.
- Results show that even top-performing LLMs produce a small but significant proportion (0.88% to 2.01%) of code strikingly similar to existing open-source code.
- Most LLMs fail to provide accurate license information, particularly for code under copyleft licenses, highlighting the need for improved compliance capabilities in AI-assisted coding tools.
Source: A First Look at License Compliance Capability of LLMs in Code Generation
CPSBench: Evaluating LLMs for Cyber-Physical Systems Requirements Modeling
CPSBench is a benchmark for evaluating large language models' (LLMs) ability to model requirements for cyber-physical systems (CPSs) using problem diagrams.
- The benchmark focuses on two key tasks: entity recognition and interaction extraction from domain-specific CPS requirement documents.
- Seven advanced LLMs were tested using CPSBench, providing insights into their capabilities and limitations in understanding and modeling CPS requirements.
- The study also establishes a taxonomy of LLM hallucinations specific to CPS requirements modeling, which could guide future research in automating this process.
- This research addresses the challenge of manually extracting and modeling CPS requirements, which is typically time-consuming, labor-intensive, and error-prone.
Source: An Evaluation of Requirements Modeling for Cyber-Physical Systems via LLMs
MAO: Automated Process Model Generation Framework
MAO is a framework for automatically generating process models using multi-agent orchestration and large language models (LLMs), aiming to enhance efficiency in process modeling.
- The framework employs LLMs as the foundation for multi-agent collaboration, utilizing an innovative prompt strategy for effective coordination.
- MAO's process consists of four phases: generation of an initial rough model from text descriptions, refinement through multi-round agent dialogues, reviewing to address semantic hallucinations, and testing to detect and correct format errors.
- Experimental results show MAO outperforms existing methods and surpasses manual modeling by 89%, 61%, 52%, and 75% on four different datasets, respectively.
- This approach offers a more efficient and cost-effective alternative to traditional process modeling methods, which often require extensive expert participation.
Source: MAO: A Framework for Process Model Generation with Multi-Agent Orchestration
Analysis of OpenAI Developer Forum: Trends and Developer Challenges
A comprehensive study of the OpenAI Developer Forum, examining user engagement patterns and developer concerns in working with large language models (LLMs).
- Quantitative analysis covered 29,576 forum topics, investigating temporal trends, topic popularity across categories, and user contributions at various trust levels.
- Qualitative analysis of 9,301 recently active topics, with a sample of 886 topics used to construct a taxonomy of developer concerns.
- The study uncovered critical concerns raised by developers in creating AI-powered applications and offered targeted recommendations to address them.
- Findings aim to advance AI-assisted software engineering and empower developer communities to shape the responsible evolution and integration of AI technology.
Source: Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum
FAIL: Industry-Specific Software Failure Analysis Using LLMs
A research project utilizing the Failure Analysis Investigation with LLMs (FAIL) model to extract and categorize industry-specific information about software failures from news articles.
- The study extends previous work by categorizing articles into specific domains and types of software failures, with results visually represented through graphs.
- Analysis reveals that certain software failures occur more frequently in specific industries, providing valuable insights for software engineers and companies.
- The research demonstrates the synergy between software engineering and LLMs in automating and enhancing the analysis of software failures.
- By transforming data from the FAIL database into an industry-specific model, the project aims to identify common vulnerabilities, predict potential risks, and implement proactive measures for preventing software failures.
Source: Exploring the extent of similarities in software failures across industries using LLMs
LogFixer: Automated Detection and Correction of Logging Statement Defects
A framework for automatic detection and updating of logging statements in software, addressing four types of defects identified through real-world log changes analysis.
- LogFixer operates in two stages: offline classification of synthetic defective logs and online evaluation of code snippets for necessary improvements.
- The framework uses a similarity-based classifier for defect detection and an LLM-based recommendation system for suggesting updates based on historical log changes.
- Evaluation on real-world and synthetic datasets showed an F1 score of 0.625, with significant improvements in static text (48.12%) and dynamic variables (24.90%) suggestions.
- Testing on new real-world projects resulted in a 61.49% success rate for recommending correct updates.
- Practical application led to 25 confirmed and merged changes across 11 projects after reporting 40 problematic logs to GitHub.
Source: Automated Defects Detection and Fix in Logging Statement