[AI Dev Tools] OpenDevin paper, Coding Agents, Automated Math Visualization ...
![[AI Dev Tools] OpenDevin paper, Coding Agents, Automated Math Visualization ...](/content/images/size/w960/2024/08/screenshot.png)
OpenDevin: AI Software Developer Platform
OpenDevin is a platform for developing AI agents that interact with the world like human developers, using code writing, command line interaction, and web browsing.
- The platform enables implementation of new agents, safe interaction with sandboxed environments for code execution, and coordination between multiple agents.
- Evaluation benchmarks can be incorporated, with current tests covering 15 challenging tasks in software engineering and web browsing.
- Released under the MIT license, OpenDevin is a community project with contributions from over 160 developers across academia and industry.
- The platform aims to foster the development of powerful and flexible AI agents that can leverage software tools similarly to human programmers.
Source: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
LLMDocParser: PDF Parsing and Analysis with LLMs
LLMDocParser is a Python package that parses PDFs and analyzes their content using LLMs, improving upon the concept of gptpdf.
Key Features:- Uses layout analysis to identify and categorize different regions of PDF pages, including text, titles, figures, tables, headers, footers, references, and equations.
- Employs multimodal models like GPT-4V or Qwen-VL to generate text blocks from parsed PDF regions, creating RAG-friendly output.
- Supports various LLM providers, including Azure, OpenAI, and DashScope, with options for customizing API endpoints and deployment settings.
- Offers concurrent processing capabilities to improve efficiency when parsing multiple pages or documents.
- Provides detailed output including file paths, content types, page numbers, and extracted text for each parsed region.
- For example, users can analyze academic papers, extracting structured information from different sections and visualizations.
PyBench: A Comprehensive Benchmark for LLM Agents in Real-World Coding Tasks
PyBench is a benchmark designed to evaluate LLM agents on various real-world coding tasks, including data analysis and image editing. It aims to bridge the gap between overly simplistic and complex coding benchmarks.
- The benchmark encompasses five main categories of real-world tasks, covering more than 10 types of files.
- Tasks require LLM agents to reason and execute Python code via a code interpreter for multiple turns before providing a formal response to user queries.
- Successful completion of PyBench tasks demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code.
- Evaluations show that current open-source LLMs struggle with these tasks, highlighting the need for comprehensive abilities in coding tasks.
- PyLlama3, a fine-tuned 8B size model, achieves performance surpassing many 33B and 70B size models on PyBench.
Source: PyBench: Evaluating LLM Agent on various real-world coding tasks
Patched RTC: LLM Evaluation for Software Development Tasks
Patched Round-Trip Correctness (Patched RTC) is a novel evaluation technique for LLMs applied to diverse software development tasks, particularly "outer loop" activities like bug fixing, code review, and documentation updates.
- An extension of the original Round-Trip Correctness method, compatible with any LLM and downstream task.
- Provides a self-evaluating framework measuring consistency and robustness of model responses without human intervention.
- Implemented in an open-source framework called patchwork, allowing transparent evaluation during inference across various patchflows.
- Experiments comparing GPT-3.5 and GPT-4 models show Patched RTC effectively distinguishes model performance and task difficulty.
- The study explores the impact of consistency prompts on improving model accuracy, suggesting Patched RTC can guide prompt refinement and model selection for complex software development workflows.
- https://github.com/patchedcodes/patchwork (inactive at the time of publishing)
Source: Patched RTC: evaluating LLMs for diverse software development tasks
AgileGen: Agile-Based Generative Software Development with Human-AI Collaboration
AgileGen is a system that enhances software development through human-AI teamwork, using Agile methodologies and Gherkin for testable requirements.
- The system addresses the challenge of incomplete user requirements in software development, which often hinders full application functionality.
- AgileGen employs Gherkin for testable requirements, ensuring semantic consistency between user needs and generated code.
- Human-AI collaboration is a key feature, allowing users to participate in decision-making processes where they excel.
- A memory pool mechanism collects and recommends user decision-making scenarios, improving the reliability of future user interactions.
- Performance tests show AgileGen outperformed existing methods by 16.4% and received higher user satisfaction ratings.
Source: Empowering Agile-Based Generative Software Development through Human-AI Teamwork
LLM-based Autonomic Computing for Microservice Management
A framework exploring the use of LLMs to achieve Autonomic Computing Vision (ACV) in microservice management, aiming for self-managing computing systems.
- The study introduces a five-level taxonomy for autonomous service maintenance, addressing the challenges of realizing ACV in complex, dynamic computing environments.
- An online evaluation benchmark based on the Sock Shop microservice demo project assesses the framework's performance.
- Results show significant progress towards Level 3 autonomy, demonstrating LLMs' effectiveness in detecting and resolving issues within microservice architectures.
- The research contributes to advancing autonomic computing by integrating LLMs into microservice management frameworks, potentially leading to more adaptive and self-managing systems.
- Code for the framework will be available at https://aka.ms/ACV-LLM.
Source: The Vision of Autonomic Computing: Can LLMs Make It a Reality?
MathViz-E: Automated Math Visualization and Solving System
MathViz-E is an automated system for mathematical pedagogy that visualizes and solves math problems from natural language commands.
- The system orchestrates mathematical solvers and graphing tools to produce accurate visualizations.
- Specialized datasets were created to address the lack of existing training and evaluation data in this domain.
- An auto-evaluator was developed to assess system outputs by comparing them to ground-truth expressions.
- The project explores challenges in domain-specific tool-using agents, including control of specialized tools and automated system evaluation.
- Datasets and code for the system have been open-sourced, promoting further research and development in this area.
Source: MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents
Multimodal LLMs for Non-Crash Functional Bug Detection in Android Apps
A study exploring the use of large language models (LLMs) as test oracles for detecting non-crash functional (NCF) bugs in Android apps, addressing limitations of traditional GUI testing techniques.
- The research investigates LLMs' effectiveness in NCF bug detection, leveraging their extensive training on mobile app usage and bug report descriptions.
- Empirical study conducted on 71 well-documented NCF bugs showed LLMs achieved a 49% bug detection rate, surpassing existing tools for Android NCF bug detection.
- Using LLMs as test oracles, researchers discovered 24 previously unknown NCF bugs in 64 Android apps, with four bugs confirmed or fixed.
- Limitations of LLMs in this context include performance degradation, inherent randomness, and false positives.
- The study highlights the potential of LLMs in Android NCF bug detection and suggests areas for future research in this field.
- https://github.com/jub7007/OLLM (inactive at the time of publishing)
Source: A Study of Using Multimodal LLMs for Non-Crash Functional Bug Detection in Android Apps
AppWorld: Benchmark for Interactive Coding Agents in Digital Task Automation
AppWorld is a comprehensive benchmark and execution environment for evaluating autonomous agents' ability to perform complex digital tasks across multiple applications.
- AppWorld Engine: A high-quality execution environment comprising 9 day-to-day apps with 457 APIs, simulating digital activities of about 100 fictitious users.
- AppWorld Benchmark: A suite of 750 diverse and challenging tasks requiring rich, interactive code generation for autonomous agents.
- The benchmark supports robust programmatic evaluation using state-based unit tests, allowing for various task completion methods while checking for unexpected changes.
- GPT-4, the state-of-the-art LLM, solves only ~49% of 'normal' tasks and ~30% of 'challenge' tasks, highlighting the benchmark's difficulty.
- AppWorld aims to advance the development of interactive coding agents capable of handling complex, real-world digital tasks.
Source: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
SAST Tools vs LLMs: Vulnerability Detection in Java, C, and Python Repositories
A comparative study of Static Application Security Testing (SAST) tools and large language models (LLMs) for detecting software vulnerabilities in Java, C, and Python repositories.
- The study evaluated 15 SAST tools and 12 open-source LLMs across three popular programming languages.
- SAST tools demonstrated low vulnerability detection rates with relatively few false positives.
- LLMs achieved 90% to 100% vulnerability detection rates but suffered from high false positive rates.
- Combining SAST tools and LLMs showed potential in mitigating the drawbacks of both approaches.
- The analysis provides insights into current progress and future directions for software vulnerability detection.
Automated User Feedback Processing for Software Engineering
A comprehensive overview of techniques for processing and analyzing user feedback in software engineering, addressing challenges of quantity and quality.
- User feedback from social media, product forums, and app stores provides valuable insights for requirements engineering, UI design, and software development.
- Benefits include better understanding of feature usage, faster defect identification and resolution, and inspiration for improvements.
- Two main challenges: managing large quantities of feedback data and dealing with varying quality of feedback items.
- The chapter outlines data mining, machine learning, and natural language processing techniques, including LLMs, to address these challenges.
- Guidance is provided for researchers and practitioners on implementing effective analysis of user feedback for software and requirements engineering.
Source: On the Automated Processing of User Feedback
Evidence-Based Practices in LLM Programming Assistants: An Evaluation
A study evaluating the adoption of evidence-based software engineering practices by LLM-based programming assistants.
- The research investigated 17 evidence-based claims from empirical software engineering across five LLM programming assistants.
- Findings revealed ambiguous beliefs regarding research claims and a lack of credible evidence to support responses from these assistants.
- LLM-based programming assistants showed an inability to adopt practices demonstrated by empirical software engineering research for development tasks.
- The study provides implications for practitioners using these assistants in development contexts and suggests future research directions to enhance their reliability and trustworthiness.
- The goal is to increase awareness and adoption of evidence-based software engineering research findings in practice.
Source: Exploring the Evidence-Based Beliefs and Behaviors of LLM-Based Programming Assistants
LLMs for Test Smell Detection: A Multi-Language Evaluation
An evaluation of ChatGPT-4, Mistral Large, and Gemini Advanced's ability to detect test smells across seven programming languages.
- Test smells, coding issues that can impact software maintainability and reliability, were the focus of this study.
- The evaluation covered 30 types of test smells in codebases from seven different programming languages.
- ChatGPT-4 performed best, identifying 21 types of test smells, followed by Gemini Advanced with 17 and Mistral Large with 15.
- Results suggest LLMs have potential as valuable tools for identifying test smells, offering an alternative to traditional static analysis or machine learning techniques.
Source: Evaluating Large Language Models in Detecting Test Smells
CAPE: Chat-like Asserts Prediction for Python Using LLMs
CAPE is an approach that generates meaningful assert statements for Python projects using LLMs and interpreter interaction.
- The system employs persona, Chain-of-Thought, and one-shot learning techniques in prompt design to enhance assert statement generation.
- CAPE conducts multiple rounds of communication between the LLM and Python interpreter to produce effective assert statements.
- Evaluation shows 64.7% accuracy for single assert statement generation and 62% for overall assert statement generation, surpassing existing methods.
- A Python assert statement dataset from GitHub was created to support the research.
- The approach has potential applications in automated Python unit test generation and broader software engineering practices.
Source: Chat-like Asserts Prediction with the Support of Large Language Model
HardEval: A Framework for Assessing LLM Task Difficulty in Programming
HardEval is a framework for evaluating the difficulty of programming tasks for LLMs and creating new challenging tasks based on identified hard problems.
- Current LLM evaluations often use general metrics over benchmarks, which may not accurately reflect the difficulty of individual tasks or the true capabilities of the models.
- The framework uses diverse prompts across multiple LLMs to generate a difficulty score for each task in a benchmark. This approach helps identify truly challenging problems for LLMs.
- Analysis of two code generation benchmarks, HumanEval+ and ClassEval, revealed that only 21% and 27% of their tasks, respectively, are difficult for LLMs.
- HardEval identified six practical hard task topics, which were used to generate new challenging tasks for LLM evaluation and improvement.
- The framework's generalistic approach can be applied to various domains beyond code generation, such as code completion or question-answering tasks.
Source: Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models