8 min read

[AI Dev Tools] OpenDevin paper, Coding Agents, Automated Math Visualization ...

[AI Dev Tools] OpenDevin paper, Coding Agents, Automated Math Visualization ...
Source: https://github.com/OpenDevin/OpenDevin

OpenDevin: AI Software Developer Platform

OpenDevin is a platform for developing AI agents that interact with the world like human developers, using code writing, command line interaction, and web browsing.

  • The platform enables implementation of new agents, safe interaction with sandboxed environments for code execution, and coordination between multiple agents.
  • Evaluation benchmarks can be incorporated, with current tests covering 15 challenging tasks in software engineering and web browsing.
  • Released under the MIT license, OpenDevin is a community project with contributions from over 160 developers across academia and industry.
  • The platform aims to foster the development of powerful and flexible AI agents that can leverage software tools similarly to human programmers.
Tools you can use from the paper:

Source: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

LLMDocParser: PDF Parsing and Analysis with LLMs

LLMDocParser is a Python package that parses PDFs and analyzes their content using LLMs, improving upon the concept of gptpdf.

Key Features:
  • Uses layout analysis to identify and categorize different regions of PDF pages, including text, titles, figures, tables, headers, footers, references, and equations.
  • Employs multimodal models like GPT-4V or Qwen-VL to generate text blocks from parsed PDF regions, creating RAG-friendly output.
  • Supports various LLM providers, including Azure, OpenAI, and DashScope, with options for customizing API endpoints and deployment settings.
  • Offers concurrent processing capabilities to improve efficiency when parsing multiple pages or documents.
  • Provides detailed output including file paths, content types, page numbers, and extracted text for each parsed region.
  • For example, users can analyze academic papers, extracting structured information from different sections and visualizations.
Source: https://github.com/lazyFrogLOL/llmdocparser

PyBench: A Comprehensive Benchmark for LLM Agents in Real-World Coding Tasks

PyBench is a benchmark designed to evaluate LLM agents on various real-world coding tasks, including data analysis and image editing. It aims to bridge the gap between overly simplistic and complex coding benchmarks.

  • The benchmark encompasses five main categories of real-world tasks, covering more than 10 types of files.
  • Tasks require LLM agents to reason and execute Python code via a code interpreter for multiple turns before providing a formal response to user queries.
  • Successful completion of PyBench tasks demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code.
  • Evaluations show that current open-source LLMs struggle with these tasks, highlighting the need for comprehensive abilities in coding tasks.
  • PyLlama3, a fine-tuned 8B size model, achieves performance surpassing many 33B and 70B size models on PyBench.
Tools you can use from the paper:

Source: PyBench: Evaluating LLM Agent on various real-world coding tasks

Patched RTC: LLM Evaluation for Software Development Tasks

Patched Round-Trip Correctness (Patched RTC) is a novel evaluation technique for LLMs applied to diverse software development tasks, particularly "outer loop" activities like bug fixing, code review, and documentation updates.

  • An extension of the original Round-Trip Correctness method, compatible with any LLM and downstream task.
  • Provides a self-evaluating framework measuring consistency and robustness of model responses without human intervention.
  • Implemented in an open-source framework called patchwork, allowing transparent evaluation during inference across various patchflows.
  • Experiments comparing GPT-3.5 and GPT-4 models show Patched RTC effectively distinguishes model performance and task difficulty.
  • The study explores the impact of consistency prompts on improving model accuracy, suggesting Patched RTC can guide prompt refinement and model selection for complex software development workflows.
Tools you can use from the paper:

Source: Patched RTC: evaluating LLMs for diverse software development tasks

AgileGen: Agile-Based Generative Software Development with Human-AI Collaboration

AgileGen is a system that enhances software development through human-AI teamwork, using Agile methodologies and Gherkin for testable requirements.

  • The system addresses the challenge of incomplete user requirements in software development, which often hinders full application functionality.
  • AgileGen employs Gherkin for testable requirements, ensuring semantic consistency between user needs and generated code.
  • Human-AI collaboration is a key feature, allowing users to participate in decision-making processes where they excel.
  • A memory pool mechanism collects and recommends user decision-making scenarios, improving the reliability of future user interactions.
  • Performance tests show AgileGen outperformed existing methods by 16.4% and received higher user satisfaction ratings.
Tools you can use from the paper:

Source: Empowering Agile-Based Generative Software Development through Human-AI Teamwork

LLM-based Autonomic Computing for Microservice Management

A framework exploring the use of LLMs to achieve Autonomic Computing Vision (ACV) in microservice management, aiming for self-managing computing systems.

  • The study introduces a five-level taxonomy for autonomous service maintenance, addressing the challenges of realizing ACV in complex, dynamic computing environments.
  • An online evaluation benchmark based on the Sock Shop microservice demo project assesses the framework's performance.
  • Results show significant progress towards Level 3 autonomy, demonstrating LLMs' effectiveness in detecting and resolving issues within microservice architectures.
  • The research contributes to advancing autonomic computing by integrating LLMs into microservice management frameworks, potentially leading to more adaptive and self-managing systems.
  • Code for the framework will be available at https://aka.ms/ACV-LLM.
Tools you can use from the paper:

Source: The Vision of Autonomic Computing: Can LLMs Make It a Reality?

MathViz-E: Automated Math Visualization and Solving System

MathViz-E is an automated system for mathematical pedagogy that visualizes and solves math problems from natural language commands.

  • The system orchestrates mathematical solvers and graphing tools to produce accurate visualizations.
  • Specialized datasets were created to address the lack of existing training and evaluation data in this domain.
  • An auto-evaluator was developed to assess system outputs by comparing them to ground-truth expressions.
  • The project explores challenges in domain-specific tool-using agents, including control of specialized tools and automated system evaluation.
  • Datasets and code for the system have been open-sourced, promoting further research and development in this area.
Tools you can use from the paper:

Source: MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents

Multimodal LLMs for Non-Crash Functional Bug Detection in Android Apps

A study exploring the use of large language models (LLMs) as test oracles for detecting non-crash functional (NCF) bugs in Android apps, addressing limitations of traditional GUI testing techniques.

  • The research investigates LLMs' effectiveness in NCF bug detection, leveraging their extensive training on mobile app usage and bug report descriptions.
  • Empirical study conducted on 71 well-documented NCF bugs showed LLMs achieved a 49% bug detection rate, surpassing existing tools for Android NCF bug detection.
  • Using LLMs as test oracles, researchers discovered 24 previously unknown NCF bugs in 64 Android apps, with four bugs confirmed or fixed.
  • Limitations of LLMs in this context include performance degradation, inherent randomness, and false positives.
  • The study highlights the potential of LLMs in Android NCF bug detection and suggests areas for future research in this field.
Tools you can use from the paper:

Source: A Study of Using Multimodal LLMs for Non-Crash Functional Bug Detection in Android Apps

AppWorld: Benchmark for Interactive Coding Agents in Digital Task Automation

AppWorld is a comprehensive benchmark and execution environment for evaluating autonomous agents' ability to perform complex digital tasks across multiple applications.

  • AppWorld Engine: A high-quality execution environment comprising 9 day-to-day apps with 457 APIs, simulating digital activities of about 100 fictitious users.
  • AppWorld Benchmark: A suite of 750 diverse and challenging tasks requiring rich, interactive code generation for autonomous agents.
  • The benchmark supports robust programmatic evaluation using state-based unit tests, allowing for various task completion methods while checking for unexpected changes.
  • GPT-4, the state-of-the-art LLM, solves only ~49% of 'normal' tasks and ~30% of 'challenge' tasks, highlighting the benchmark's difficulty.
  • AppWorld aims to advance the development of interactive coding agents capable of handling complex, real-world digital tasks.
Tools you can use from the paper:

Source: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

SAST Tools vs LLMs: Vulnerability Detection in Java, C, and Python Repositories

A comparative study of Static Application Security Testing (SAST) tools and large language models (LLMs) for detecting software vulnerabilities in Java, C, and Python repositories.

  • The study evaluated 15 SAST tools and 12 open-source LLMs across three popular programming languages.
  • SAST tools demonstrated low vulnerability detection rates with relatively few false positives.
  • LLMs achieved 90% to 100% vulnerability detection rates but suffered from high false positive rates.
  • Combining SAST tools and LLMs showed potential in mitigating the drawbacks of both approaches.
  • The analysis provides insights into current progress and future directions for software vulnerability detection.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Automated User Feedback Processing for Software Engineering

A comprehensive overview of techniques for processing and analyzing user feedback in software engineering, addressing challenges of quantity and quality.

  • User feedback from social media, product forums, and app stores provides valuable insights for requirements engineering, UI design, and software development.
  • Benefits include better understanding of feature usage, faster defect identification and resolution, and inspiration for improvements.
  • Two main challenges: managing large quantities of feedback data and dealing with varying quality of feedback items.
  • The chapter outlines data mining, machine learning, and natural language processing techniques, including LLMs, to address these challenges.
  • Guidance is provided for researchers and practitioners on implementing effective analysis of user feedback for software and requirements engineering.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: On the Automated Processing of User Feedback

Evidence-Based Practices in LLM Programming Assistants: An Evaluation

A study evaluating the adoption of evidence-based software engineering practices by LLM-based programming assistants.

  • The research investigated 17 evidence-based claims from empirical software engineering across five LLM programming assistants.
  • Findings revealed ambiguous beliefs regarding research claims and a lack of credible evidence to support responses from these assistants.
  • LLM-based programming assistants showed an inability to adopt practices demonstrated by empirical software engineering research for development tasks.
  • The study provides implications for practitioners using these assistants in development contexts and suggests future research directions to enhance their reliability and trustworthiness.
  • The goal is to increase awareness and adoption of evidence-based software engineering research findings in practice.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Exploring the Evidence-Based Beliefs and Behaviors of LLM-Based Programming Assistants

LLMs for Test Smell Detection: A Multi-Language Evaluation

An evaluation of ChatGPT-4, Mistral Large, and Gemini Advanced's ability to detect test smells across seven programming languages.

  • Test smells, coding issues that can impact software maintainability and reliability, were the focus of this study.
  • The evaluation covered 30 types of test smells in codebases from seven different programming languages.
  • ChatGPT-4 performed best, identifying 21 types of test smells, followed by Gemini Advanced with 17 and Mistral Large with 15.
  • Results suggest LLMs have potential as valuable tools for identifying test smells, offering an alternative to traditional static analysis or machine learning techniques.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Evaluating Large Language Models in Detecting Test Smells

CAPE: Chat-like Asserts Prediction for Python Using LLMs

CAPE is an approach that generates meaningful assert statements for Python projects using LLMs and interpreter interaction.

  • The system employs persona, Chain-of-Thought, and one-shot learning techniques in prompt design to enhance assert statement generation.
  • CAPE conducts multiple rounds of communication between the LLM and Python interpreter to produce effective assert statements.
  • Evaluation shows 64.7% accuracy for single assert statement generation and 62% for overall assert statement generation, surpassing existing methods.
  • A Python assert statement dataset from GitHub was created to support the research.
  • The approach has potential applications in automated Python unit test generation and broader software engineering practices.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Chat-like Asserts Prediction with the Support of Large Language Model

HardEval: A Framework for Assessing LLM Task Difficulty in Programming

HardEval is a framework for evaluating the difficulty of programming tasks for LLMs and creating new challenging tasks based on identified hard problems.

  • Current LLM evaluations often use general metrics over benchmarks, which may not accurately reflect the difficulty of individual tasks or the true capabilities of the models.
  • The framework uses diverse prompts across multiple LLMs to generate a difficulty score for each task in a benchmark. This approach helps identify truly challenging problems for LLMs.
  • Analysis of two code generation benchmarks, HumanEval+ and ClassEval, revealed that only 21% and 27% of their tasks, respectively, are difficult for LLMs.
  • HardEval identified six practical hard task topics, which were used to generate new challenging tasks for LLM evaluation and improvement.
  • The framework's generalistic approach can be applied to various domains beyond code generation, such as code completion or question-answering tasks.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models