8 min read

[AI Dev Tools] AI-Assisted Debugging, Patch Generation, Code Summarization...

[AI Dev Tools] AI-Assisted Debugging, Patch Generation, Code Summarization...
Source: https://www.cursorlens.com

Cursor Lens: An open-source dashboard for Cursor.sh IDE

Cursor Lens is an open-source tool that provides insights into AI-assisted coding sessions using Cursor AI, acting as a proxy between Cursor and various AI providers.

Key Features:
  • Integrates with multiple AI providers including OpenAI and Anthropic, capturing and logging all requests between Cursor and AI providers.
  • Offers a visual analytics dashboard displaying AI usage, token consumption, and request patterns, along with real-time monitoring of ongoing AI interactions.
  • Allows users to configure and switch between different AI models, tracking token usage and providing cost estimates based on model pricing.
  • Built using Next.js with React for the frontend and backend, PostgreSQL with Prisma ORM for the database, and Tailwind CSS with shadcn/ui components for styling.
  • Supports prompt caching with Anthropic, allowing system and context messages in specific chats to be cached for improved efficiency.
Source: https://github.com/HamedMP/CursorLens

PatUntrack: Automated Patch Example Generation for Issue Reports

PatUntrack is a system that automatically generates patch examples from vulnerability issue reports (IRs) without tracked insecure code, using LLMs to analyze vulnerabilities.

  • The system generates a complete description of the Vulnerability-Triggering Path (VTP) from vulnerable IRs.
  • PatUntrack corrects hallucinations in the VTP description using external golden knowledge.
  • It then produces Top-K pairs of Insecure Code and Patch Examples based on the corrected VTP description.
  • Experiments on 5,465 vulnerable IRs showed PatUntrack outperformed traditional LLM baselines by 14.6% (Fix@10) on average in patch example generation.
  • In a real-world application, 27 out of 37 IR authors confirmed the usefulness of PatUntrack-generated patch examples for 76 newly disclosed vulnerable IRs.

Source: PatUntrack: Automated Generating Patch Examples for Issue Reports without Tracked Insecure Code

CrashTracker: Explainable Fault Localization for Framework-Specific Crashes

A tool that combines static analysis and LLMs to locate and explain crashing faults in applications relying on complex frameworks, particularly focusing on Android framework-specific crashes.

  • The approach uses exception-thrown summaries (ETS) to describe key elements related to framework-specific exceptions, extracted through static analysis.
  • Data-tracking of ETS elements helps identify and prioritize potential buggy methods for a given crash.
  • LLMs enhance result explainability using candidate information summaries (CIS), which provide multiple types of explanation-related contexts.
  • CrashTracker achieved a 0.91 MRR value in fault localization precision and improved user satisfaction scores for fault explanations by 67.04% compared to static analysis alone.
Source: Better Debugging: Combining Static Analysis and LLMs for Explainable Crashing Fault Localization

UTGen: Enhancing Automated Unit Test Understandability with LLMs

UTGen combines search-based software testing and LLMs to improve the comprehensibility of automatically generated unit tests, addressing a common challenge faced by software engineers.

  • The tool enhances test understandability by contextualizing test data, improving identifier naming, and adding descriptive comments.
  • A controlled experiment with 32 participants from academia and industry evaluated UTGen's impact on bug-fixing tasks.
  • Results showed participants using UTGen test cases fixed up to 33% more bugs and required up to 20% less time compared to baseline test cases.
  • Feedback from participants indicated that enhanced test names, test data, and variable names contributed to an improved bug-fixing process.

Source: Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests

Kubernetes Manifest Generation: LLM-Based Approach Evaluation

A study proposing a benchmarking method to evaluate the effectiveness of LLMs in synthesizing Kubernetes manifests from Compose specifications.

  • The benchmark uses Compose specifications as input, a standard widely adopted by application developers.
  • Results show LLMs generally produce accurate manifests and compensate for simple specification gaps.
  • Inline comments for readability were often omitted in the generated manifests.
  • LLMs demonstrated low completion accuracy for atypical inputs with unclear intentions.
  • The study aims to address the complexity barrier of Kubernetes for developers unfamiliar with the system.
Tools you can use from the paper:

Source: Migrating Existing Container Workload to Kubernetes -- LLM Based Approach and Evaluation

CodeJudge-Eval: A Benchmark for LLMs' Code Understanding

CodeJudge-Eval (CJ-Eval) is a new benchmark designed to assess large language models' (LLMs) code understanding abilities through code judging rather than code generation.

  • The benchmark challenges models to determine the correctness of provided code solutions, including various error types and compilation issues.
  • CJ-Eval addresses limitations of traditional benchmarks, such as potential memorization of solutions, by using a diverse set of problems and a fine-grained judging system.
  • Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle with code understanding tasks.
  • The benchmark will be available on GitHub, providing a new tool for researchers to assess and improve LLMs' code comprehension capabilities.
Tools you can use from the paper:

Source: CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

SWE-bench-java: Java-focused GitHub Issue Resolution Benchmark

SWE-bench-java is a benchmark for evaluating LLMs' capabilities in resolving GitHub issues for Java projects, expanding on the original Python-focused SWE-bench.

  • The benchmark includes a publicly available dataset, Docker-based evaluation environment, and leaderboard.
  • Reliability of SWE-bench-java was verified by implementing SWE-agent and testing several powerful LLMs.
  • Continuous maintenance and updates are planned for the coming months to improve the benchmark.
  • The project aims to support multilingual issue resolution, addressing industry demand for expanded language coverage.
  • Contributions and collaborations are welcomed to accelerate the benchmark's development and refinement.
Tools you can use from the paper:

Source: SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Vulnerability Handling in AI-Generated Code: Solutions and Challenges

A study examining the current state of LLM-based approaches for handling vulnerabilities in AI-generated code, focusing on detection, localization, and repair methods.

  • The increasing use of LLMs for code generation in software development has led to improved productivity but also introduced security vulnerabilities.
  • Traditional vulnerability handling processes, which often rely on manual review, are challenging to apply to AI-generated code due to the potential for multiple, slightly varied vulnerabilities.
  • The paper explores recent progress in LLM-based approaches for vulnerability handling in AI-generated code.
  • Open challenges in establishing reliable and scalable vulnerability handling processes for AI-generated code are highlighted.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Vulnerability Handling of AI-Generated Code -- Existing Solutions and Open Challenges

AgoneTest: Automated Unit Test Generation and Evaluation System for Java Projects

AgoneTest is a system that automates the generation and evaluation of unit test suites for Java projects using LLMs, focusing on class-level test code generation.

  • The system addresses limitations in previous LLM-based unit test generation studies, which often focused on simple, small-scale scenarios.
  • AgoneTest generates more complex, real-world test suites and automates the entire process from test generation to assessment.
  • A new dataset, built upon the Methods2Test dataset, allows comparison between human-written and LLM-generated tests.
  • The system includes a comprehensive methodology for evaluating test quality, enabling scalable assessment of generated test suites.
  • AgoneTest aims to reduce the cost and labor-intensive nature of unit test creation in software development.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites

LLM-Based Quality Assessment of Software Requirements

A study exploring the use of Large Language Models (LLMs) to evaluate and improve software requirements according to ISO 29148 standards.

  • The research introduces an LLM-based approach for assessing quality characteristics of software requirements, aiming to support stakeholders in requirements engineering.
  • The LLM demonstrates capabilities in evaluating requirements, explaining its decision-making process, and proposing improved versions of requirements.
  • A validation study conducted with software engineers emphasizes the potential of LLMs in enhancing the quality of software requirements.
  • This approach could significantly reduce development costs and improve overall software quality by ensuring high-quality requirements from the outset.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Leveraging LLMs for the Quality Assurance of Software Requirements

Java Method Summarization: Comparing Lightweight Approaches to ASAP

A study comparing simple, lightweight approaches for automatically generating Java method summaries to the more complex Automatic Semantic Augmentation of Prompts (ASAP) method.

  • Four lightweight approaches were evaluated against ASAP, using only the method body as input without requiring static program analysis or exemplars.
  • Experiments were conducted on an Ericsson software project and replicated with open-source projects Guava and Elasticsearch.
  • Performance was measured across eight similarity metrics, with one lightweight approach performing as well as or better than ASAP in both Ericsson and open-source projects.
  • An ablation study revealed that the proposed approaches were less influenced by method names compared to ASAP, suggesting more comprehensive derivation from the method body.
  • The findings indicate potential for rapid deployment of lightweight summarization techniques in commercial software development environments.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Icing on the Cake: Automatic Code Summarization at Ericsson

Enhancing Code Maintainability in LLM-Generated Python

A study focusing on improving the maintainability of Python code generated by LLMs through fine-tuning and specialized datasets.

  • The research addresses the growing concern of code maintainability in LLM-generated output, an aspect often overlooked in favor of functional accuracy and testing success.
  • A specially designed dataset was created for training and evaluating the model, ensuring a comprehensive assessment of code maintainability.
  • The core of the study involves fine-tuning an LLM for code refactoring, aiming to enhance readability, reduce complexity, and improve overall maintainability.
  • Evaluation results indicate significant improvements in code maintainability standards, suggesting a promising direction for AI-assisted software development.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Better Python Programming for all: With the focus on Maintainability

Generative LLMs in Requirements Engineering: Potential and Challenges

A discussion on how generative LLMs like GPT could transform Requirements Engineering (RE) by automating various tasks, emphasizing the importance of precise prompts for effective interactions.

  • LLMs have the potential to revolutionize RE processes through automation of tasks.
  • Precise prompts are crucial for effective interactions with LLMs in RE contexts.
  • Human evaluation remains essential in leveraging LLM capabilities for RE.
  • Prompt engineering is a key skill for maximizing the benefits of LLMs in RE workflows.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: From Specifications to Prompts: On the Future of Generative LLMs in Requirements Engineering

ChatGPT App Ecosystem: Distribution, Deployment, and Security Analysis

A comprehensive study of the ChatGPT app ecosystem, examining distribution, deployment models, and security implications of third-party plugins.

  • The study analyzes the integration of LLMs with third-party apps, focusing on ChatGPT plugins distributed through OpenAI's plugin store.
  • Findings reveal an uneven distribution of functionality among ChatGPT plugins, with certain topics being more prevalent than others.
  • Severe flaws in authentication and user data protection were identified in third-party app APIs integrated with LLMs, raising concerns about security and privacy in the ecosystem.
  • The research aims to provide insights for secure and sustainable development of this rapidly evolving ecosystem, addressing potential barriers to broader adoption by developers and users.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Exploring ChatGPT App Ecosystem: Distribution, Deployment and Security

LLM-Generated Code Documentation: A Quantitative and Qualitative Study

A study evaluating the use of OpenAI GPT-3.5 for generating Javadoc documentation, comparing AI-generated comments with original human-written ones through both quantitative and qualitative assessments.

  • The research utilized GPT-3.5 to regenerate Javadoc for 23,850 code snippets, including methods and classes.
  • Qualitative analysis showed 69.7% of AI-generated comments were equivalent (45.7%) or required minor changes to be equivalent (24.0%) to the original documentation.
  • 22.4% of GPT-generated comments were rated as superior in quality compared to the original human-written documentation.
  • The study revealed inconsistencies in using quantitative metrics like BLEU for assessing comment quality. Some AI-generated comments perceived as higher quality were unfairly penalized by BLEU scores.
  • Findings suggest LLMs could potentially automate and improve code documentation, easing the burden on developers while maintaining or enhancing quality.
Tools you can use from the paper:
No implementation tools or repository links are provided.

Source: Using Large Language Models to Document Code: A First Quantitative and Qualitative Assessment