16 Dec 2024 16 min read

[AI Dev Tools] Knowledge Graphs, AI Compilers, Fuzzing, and Test Case Automation...

Source: https://github.com/AndySpider/uicloner-extension

CntxtPY: Python Codebase Context Minification for LLMs

CntxtPY is a tool that generates comprehensive knowledge graphs of Python codebases, optimizing them for LLM context windows with up to 75% token reduction.

Key Features:

Performs deep analysis of Python codebases, generating detailed knowledge graphs of module relationships, class hierarchies, function signatures, and more.
Optimizes codebase context for LLM consumption, enhancing precision and eliminating noise for better insights and recommendations.
Generates both JSON and highly compressed text versions of the knowledge graph, along with optional visualizations.
Supports modern Python frameworks and patterns, including dependency injection, decorators, and API structures.
Provides a powerful tool for LLMs to understand and analyze complex Python projects efficiently.

Source: https://github.com/brandondocusen/CntxtPY

AICC: AI-Enhanced C Compiler

AICC is an innovative C compiler that leverages AI to optimize and fix common coding issues, aiming to streamline the compilation process beyond traditional compilers like gcc and clang.

Key Features:

Utilizes ChatGPT to analyze and optimize C code, potentially fixing syntax errors and memory leaks automatically.
Simple command-line interface for easy integration into existing workflows.
Attempts to understand and implement the programmer's intent, potentially overcoming limitations of traditional compilers.
Offers flexibility in code compilation outcomes, with AI determining aspects like performance and memory management.
Can be built using traditional methods or with Nix, requiring LLVM and curl as dependencies.

Source: https://github.com/cheyao/aicc

Brainstorm: AI-Enhanced Web Fuzzing Tool

Brainstorm combines traditional web fuzzing with local LLMs to optimize directory and file discovery in web applications.

Key Features:

Integrates ffuf for fuzzing with Ollama-powered local LLMs for intelligent path generation.
Extracts initial links, uses LLMs to suggest potential paths, and learns from discoveries to refine suggestions.
Includes two specialized tools: a main fuzzer for general path discovery and a variant for short filename discovery.
Supports customizable fuzzing cycles, LLM models, and success status codes.
Provides real-time console output and saves discovered paths and filenames.

Source: https://github.com/Invicti-Security/brainstorm

oneShotCodeGen: AI-Powered Full Stack Web Application Generator

oneShotCodeGen is a command-line tool that generates complete full stack web applications from a single prompt using LLMs. It solves the challenges of creating accurate web apps from scratch by dividing the process into distinct steps and documenting assumptions.

Key Features:

Generates functional requirements, technical specifications, database setup, backend, and frontend code from a single prompt.
Supports multiple prompt versions and LLM providers (OpenAI and Anthropic) for flexibility and optimization.
Offers various code generation strategies, including full project creation and customizable prompt chains.
Produces organized project structures with documentation in a separate "/generated_project" folder.
Provides options for different technology stacks, including React, Node.js, Express, SQLite, Svelte, and Supabase.
Example use: Create a todo app with a single command: python -m ai_code_generator_cli.cli "Create a todo app"

Source: https://github.com/vivek100/oneShotCodeGen

UICloner: AI-Powered UI Component Cloning Browser Extension

UICloner is a browser extension that uses LLMs to clone UI components from any webpage and automatically generate corresponding code implementations.

Key Features:

One-click selection of UI components on any website for cloning.
Generates HTML with either Tailwind CSS or pure CSS code.
Real-time preview of cloned UI and generated code.
Supports integration with GPT-4 or Claude 3.5 for code generation (API key required).
Easy installation from the Chrome Web Store and simple setup process.
Built using modern web technologies like React, Tailwind CSS, and TypeScript.

Source: https://github.com/AndySpider/uicloner-extension

git2txt: GitHub Repository to Text File Converter

git2txt is a CLI tool that downloads GitHub repositories and combines their contents into a single text file, facilitating analysis, documentation, or AI training data preparation.

Key Features:

Downloads public GitHub repositories and converts them to a single text file, with automatic binary file exclusion and configurable file size thresholds.
Supports various repository URL formats including HTTPS, SSH, and short format (username/repository).
Offers customizable output options, including specifying the output file path and adjusting file size thresholds for inclusion.
Preserves relative file paths and separates file contents with clear markers in the output text file.
Cross-platform compatibility for Windows, macOS, and Linux.

Source: https://github.com/addyosmani/git2txt

Cali: AI Agent for Building React Native Apps

Cali is an AI agent that assists in developing React Native applications by providing LLM access to React Native CLI utilities and functions.

Key Features:

Operates as a standalone CLI tool, integrates with Vercel AI SDK, or runs as an MCP server for compatibility with Claude and other environments.
Automates build processes for iOS and Android, manages devices and simulators, and handles dependency installation.
Searches and lists React Native libraries from the React Native Directory.
Aims to streamline development by reducing the need to memorize commands and troubleshoot errors.
Example use: npx cali to start the AI agent in the terminal.

Source: https://github.com/callstackincubator/cali

FullStack Bench: Comprehensive Code Evaluation Dataset for LLMs

FullStack Bench is a new evaluation dataset designed to assess the full-stack programming capabilities of large language models (LLMs) across diverse application domains and programming languages.

The dataset covers a wide range of domains, including basic programming, data analysis, software engineering, mathematics, and machine learning.
FullStack Bench features real-world instructions and unit test cases in 16 widely-used programming languages, reflecting authentic usage scenarios rather than simple translations.
SandboxFusion, an efficient code sandbox execution tool, accompanies the dataset to support various programming languages and packages for performance evaluation.
Experimental results demonstrate the necessity and effectiveness of FullStack Bench in assessing LLMs' full-stack coding capabilities.

Tools you can use from the paper:

Source: FullStack Bench: Evaluating LLMs as Full Stack Coders

VeCoGen: Formally Verified C Code Generation Using LLMs

VeCoGen is a tool that combines LLMs with formal verification to generate C programs that are guaranteed to meet specified requirements.

The tool takes three inputs: a formal specification in ANSI/ISO C Specification Language (ACSL), a natural language specification, and a set of test cases.
VeCoGen's two-step process involves generating initial candidate programs and iteratively improving them until a program meets the formal specification.
Evaluation on 15 Codeforces competition problems showed VeCoGen successfully solved 13 of them.
This approach demonstrates the potential for automating the creation of verified programs, particularly useful for safety-critical applications.

Tools you can use from the paper:

https://anonymous.4open.science/r/Vecogen-3008/

Source: VeCoGen: Automating Generation of Formally Verified C Code with Large Language Models

ChatGPT for System Test Case Design: Effectiveness and Challenges

A study exploring the use of Large Language Models (LLMs) to generate system test case designs from Software Requirements Specification (SRS) documents, with a focus on ChatGPT-4 Turbo model's performance.

The research utilized SRS documents from five software engineering projects, containing both functional and non-functional requirements.
Prompt-chaining technique was employed, starting with a context-setting prompt followed by prompts for each use case.
87% of the generated test cases were deemed valid by developer teams, with 15% of these being previously unconsidered test scenarios.
ChatGPT was also tasked with identifying redundant test cases, which were then validated by developers to uncover false positives and missed redundancies.
The study highlights LLMs' potential in enhancing test suite quality and efficiency, while also addressing the challenges of comprehensive test case creation from requirements.

Source: System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT

TDD-Bench Verified: A Benchmark for LLM-Generated Test-Driven Development

TDD-Bench Verified is a benchmark suite of 449 real-world GitHub issues, designed to evaluate LLMs' ability to generate tests for test-driven development (TDD) before issue resolution.

The benchmark focuses on fail-to-pass tests, which should fail before issue resolution and pass afterward, while providing good code coverage.
An evaluation harness runs relevant tests in isolation for accurate coverage measurements, with the dataset filtered by both human judges and execution in the harness.
Auto-TDD, an LLM-based solution, generates tests based on issue descriptions and pre-resolution codebases to validate changes made during issue resolution.
Evaluation shows Auto-TDD outperforms prior work in fail-to-pass rates while maintaining high coverage adequacy.
The project aims to enhance developer productivity and robustness of issue resolutions through automated TDD test generation.

Source: TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?

PandasPlotBench: Evaluating LLMs in Generating Visualization Code

PandasPlotBench is a benchmark dataset for assessing LLMs' ability to generate plotting code from natural language instructions for tabular data visualization.

The dataset comprises 175 unique tasks, focusing on code generation for Matplotlib, Seaborn, and Plotly libraries.
Experiments reveal that shortening tasks has minimal impact on plotting capabilities, allowing for concise user input without sacrificing functionality.
LLMs perform well with Matplotlib and Seaborn but face challenges with Plotly, indicating areas for improvement.
The benchmark's modular design aims to expand current studies on visualization generation.
PandasPlotBench is available on Hugging Face, with the code for running the benchmark accessible on GitHub.

Source: Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code

HumanEval_T: A Benchmark for LLM Program Generation Using Combinatorial Test Design

A new benchmark construction method addresses data leakage in LLM evaluation for program generation tasks. It uses template tasks and combinatorial test design to create HumanEval_T, an alternative to the HumanEval benchmark.

Data leakage in benchmarks like HumanEval can lead to unfair evaluations of LLMs in software engineering tasks.
The proposed method creates template tasks that can be instantiated into concrete tasks, balancing uniqueness and similarity.
Concrete tasks are designed to be different enough to minimize the impact of data leakage, yet similar enough for consistent performance evaluation.
HumanEval_T serves as an example of this new approach, offering a potentially more robust alternative to the original HumanEval benchmark.

Source: Addressing Data Leakage in HumanEval Using Combinatorial Test Design

LogicFL: Logical Fault Localization for Null Pointer Exceptions

LogicFL is a novel fault localization technique that uses logic programming to identify causes of Null Pointer Exceptions (NPEs) in software.

The technique imitates human developers' deduction process, using logical inferences on collected facts about faulty code and test execution.
In an empirical evaluation of 76 NPE bugs, LogicFL accurately identified fault locations and pinpointed exact code fragments for 88.16% of bugs, outperforming two compared LLM-based techniques.
LogicFL can be executed on a typical laptop, with an average runtime of 21.63 seconds and a worst-case time under two minutes.
Compared to LLM-based techniques using GPT-4, LogicFL proved significantly more cost-efficient, requiring 343.94 to 3,736.19 times less cost.
The deduction process in LogicFL is fully traceable, allowing for better understanding of reasoning behind outcomes and further technique enhancement.

Source: Identifying Root Causes of Null Pointer Exceptions with Logical Inferences

Energy and Accuracy Trade-offs in LLMs for Software Development

A study exploring the balance between model accuracy and energy consumption for locally-deployed language models in software development tasks.

The research examined 18 LLM families on two real-world infrastructures: a commodity GPU and a powerful AI-specific GPU.
Both full-precision and quantized models were considered, addressing the need for powerful infrastructure when deploying LLMs locally.
Results showed that using larger LLMs with higher energy budgets doesn't always lead to significantly improved accuracy.
Quantized versions of large models generally outperformed full-precision versions of medium-sized models in terms of efficiency and accuracy.
No single model proved suitable for all types of software development tasks, highlighting the importance of task-specific model selection.

Source: Analyzing the Energy and Accuracy of LLMs in Software Development

C2HLSC: Automated C-to-HLS Code Conversion Using LLMs

A framework that leverages Large Language Models (LLMs) to automatically refactor C code into High-Level Synthesis (HLS) compatible formats, bridging the gap between software and hardware design.

HLS tools enable rapid hardware design from C code, but their compatibility is limited by certain code constructs. C2HLSC addresses this limitation by using LLMs to rewrite C code into HLS-synthesizable formats.
The framework employs an iterative approach, guiding the LLM to transform C code by implementing functions like streaming data and hardware-specific signals.
A preprocessing step breaks down complex designs, allowing a divide-and-conquer, bottom-up approach for handling intricate code structures.
Validation tests on various algorithms, including ciphers, hash functions, and randomness tests, demonstrated a high success rate with benchmarks significantly more complex than previous Verilog generation attempts using LLMs.

Source: C2HLSC: Leveraging Large Language Models to Bridge the Software-to-Hardware Design Gap

OpenAPI Chunking for Retrieval-Augmented Generation in System Integration

A study analyzing the effectiveness of Retrieval Augmented Generation (RAG) for endpoint discovery in system integration, focusing on OpenAPI chunking methods to optimize input for Large Language Models (LLMs).

RAG is used for endpoint discovery to reduce input token length while preserving relevant information from API descriptions.
The research introduces a Discovery Agent that works with summaries of relevant endpoints, retrieving details on demand to further reduce input token length and improve retrieval.
Evaluation using the RestBench benchmark shows high recall, precision, and F1 scores for endpoint retrieval, though further research is needed to ensure all necessary endpoints are retrieved.
LLM-based and format-specific approaches for preprocessing outperform naïve chunking methods in the experiments.
The agent-based approach improves overall RAG performance by splitting tasks into fine-grained subtasks, enhancing token count, precision, and F1 score.

Source: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation

VULTURE: Detecting 1-Day Vulnerabilities in Third-Party Library Reuse

VULTURE is a tool designed to identify 1-day vulnerabilities arising from the reuse of vulnerable third-party libraries (TPLs) in software development.

The tool addresses the security risks associated with TPL reuse, which can introduce vulnerabilities due to low maintenance and delayed updates.
VULTURE employs TPLFILTER, a database creation method utilizing Large Language Models (LLMs) to build a platform-specific database automatically.
Instead of code-level similarity comparison, the tool uses hashing-based comparison to explore dependencies among TPLs and identify similarities between TPLs and target projects.
To accommodate both exact and custom TPL reuse, VULTURE performs version-based comparison and chunk-based analysis, capturing fine-grained semantic features at the function level.
In a test on 10 real-world projects, VULTURE successfully identified 175 vulnerabilities from 178 reused TPLs, demonstrating its effectiveness in enhancing security in third-party library reuse.

Source: Enhancing Security in Third-Party Library Reuse -- Comprehensive Detection of 1-day Vulnerability through Code Patch Analysis

Action Engine: Automated FaaS Workflow Generation Using LLMs

A framework that leverages Tool-Augmented Large Language Models (LLMs) to automate Function as a Service (FaaS) workflow generation, addressing challenges faced by cloud-native application developers.

Action Engine interprets human language queries to create FaaS workflows, reducing the need for specialized expertise and manual design.
The system includes modules for identifying relevant functions from FaaS repositories and managing data dependencies between them.
Capable of executing generated workflows using user-provided parameters, streamlining the development process.
Evaluation results show up to 20% higher correctness in workflow generation compared to manual developer involvement.
Action Engine aims to make FaaS workflow creation accessible to non-cloud-savvy developers and accelerate cloud-native application development cycles.

Source: Action Engine: An LLM-based Framework for Automatic FaaS Workflow Generation

SeedMind: LLM-Powered Seed Generation for Greybox Fuzzing

SeedMind is a system that utilizes LLMs to enhance greybox fuzzing through intelligent seed generation, creating test case generators rather than direct test cases.

The system addresses the critical need for high-quality initial seeds in greybox fuzzing, especially for programs with non-standard or custom input formats.
SeedMind implements an iterative, feedback-driven process to guide LLMs in refining test case generation, aiming to increase code coverage depth and breadth.
Key challenges addressed include input format limitations, context window constraints, and ensuring consistent, progress-aware behavior.
Evaluations with real-world applications demonstrate SeedMind's effectiveness in generating high-quality test cases, showing utility comparable to human-created seeds and outperforming existing LLM-based solutions.

Source: Harnessing Large Language Models for Seed Generation in Greybox Fuzzing

FAUN-Eval: A Benchmark for LLMs' Fine-Grained Issue Solving in Software Development

FAUN-Eval is a benchmark designed to evaluate the fine-grained issue solving capabilities of LLMs in software development, focusing on question-answering, fault localization, and code editing tasks.

The benchmark addresses limitations in existing tools like HumanEval and SWE-Bench, which lack granular insights into LLMs' performance on subtasks involved in issue solving.
FAUN-Eval's dataset comprises 300 entries from 30 well-known GitHub repositories, with issue and pull request pairs meticulously compiled and validated.
Evaluation of ten LLMs (four closed-source and six open-source) revealed varying top performers across different tasks and highlighted potential challenges, such as incorrect information generation based on issue features.
The benchmark's findings also indicate that models may differ in their proficiency with texts of varying lengths, providing valuable insights for LLM selection and improvement in software development contexts.

Source: A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models

LLMs in Software Engineering: Challenges for Trustworthy Integration

Large Language Models (LLMs) are revolutionizing software engineering, offering potential for accelerated development, reduced complexity, and cost savings. Their integration into the software lifecycle promises to enhance design, development, deployment, and maintenance processes.

LLMs can drive various stages of software development, from initial design to deployment and ongoing improvement.
Benefits include early bug detection, continuous improvement capabilities, and faster resolution of critical issues.
Challenges for trustworthy LLM-driven software engineering include ensuring accuracy, scalability, and addressing potential biases.
Explainability of LLM-generated code and decisions remains a key concern for building trust in the development process.

Source: Engineering Trustworthy Software: A Mission for LLMs

LLMigrate: Efficient UI Test Transfer for Mobile Apps Using LLMs

A technique that leverages LLMs to transfer usage-based UI tests across mobile apps, significantly reducing the manual effort required for test creation.

LLMigrate achieves a 97.5% success rate in automated test transfer across Android apps.
The approach reduces manual effort for test creation by 91.1% compared to writing tests from scratch.
Performance surpasses previous methods, showing a 9.1% improvement in success rate and 38.2% in effort reduction.
Addresses limitations of earlier techniques, particularly in scenarios where source and target apps have significant variations.

Source: Automated Test Transfer Across Android Apps Using Large Language Models

LLMPrior: Crowdsourced Test Report Prioritization with LLMs

LLMPrior is a novel approach for prioritizing crowdsourced test reports using large language models (LLMs), aimed at improving the efficiency of review processes in software testing.

The method leverages LLMs to analyze and cluster test reports based on bug types revealed in their textual descriptions.
Prompt engineering techniques are employed to enhance LLM performance, followed by a recurrent selection algorithm for report prioritization.
Empirical experiments demonstrate LLMPrior's superiority over current state-of-the-art approaches in terms of performance, feasibility, efficiency, and reliability.
This innovative approach addresses the challenges of traditional prioritization methods, offering a more effective solution for app developers handling crowdsourced test reports.

Source: Redefining Crowdsourced Test Report Prioritization: An Innovative Approach with Large Language Model

Assertify: Production Code Assertion Generation Using LLMs

Assertify is an automated tool that generates production assertions for code using LLMs and prompt engineering with few-shot learning.

Production assertions help developers validate assumptions, debug code, and enhance code comprehension.
The tool creates context-rich prompts to emulate how developers write assertions for their code.
Evaluation on a dataset of 2,810 methods from 22 mature Java repositories showed an average ROUGE-L score of 0.526.
Results indicate high structural similarity between generated assertions and those written by developers, demonstrating the potential of LLMs in automating this task.

Source: ASSERTIFY: Utilizing Large Language Models to Generate Assertions for Production Code

LLM-Generated Documentation for Legacy Code: Challenges and Opportunities

A study exploring the use of LLMs to generate documentation for legacy code written in MUMPS and IBM mainframe Assembly Language Code (ALC).

The research focuses on legacy software systems written in outdated languages, which present challenges in efficiency, maintenance, staffing, and security.
Researchers proposed a prompting strategy for generating line-wise code comments and developed a rubric to evaluate their quality based on completeness, readability, usefulness, and hallucination.
LLM-generated comments for MUMPS and ALC were found to be generally hallucination-free, complete, readable, and useful when compared to ground-truth comments, though ALC posed more challenges.
The study revealed no strong correlation between automated metrics (such as code complexity and reference-based metrics) and comment quality, highlighting the limitations of current automated measures for evaluating LLM performance in this context.
Findings emphasize the need for better evaluation metrics for LLM-generated documentation in legacy systems.

Source: Leveraging LLMs for Legacy Code Modernization: Challenges and Opportunities for LLM-Generated Documentation

BugSpotter: Automated Generation of Code Debugging Exercises

BugSpotter is a tool that uses an LLM to create buggy code from problem descriptions, helping students practice debugging skills.

The tool generates buggy code and verifies the synthesized bugs using a test suite.
Students interact with BugSpotter by designing failing test cases, comparing the buggy code's output to the expected result.
This approach helps students enhance their debugging skills and practice reading problem specifications.
Classroom deployment showed that BugSpotter-generated exercises varied in difficulty and matched problem specifications well.
Student performance on LLM-generated exercises was comparable to manually created ones, suggesting BugSpotter's potential as an efficient learning aid.

Source: BugSpotter: Automated Generation of Code Debugging Exercises

LLM Middleware: Facilitating Enterprise Deployment and Adoption

A proposed middleware system architecture to support enterprises in deploying and adopting large language models (LLMs) independently from major cloud providers.

The growing popularity of LLMs has led to widespread integration of these models into enterprise services, primarily through commercial cloud-based solutions.
Enterprises are increasingly motivated to self-host "LLM as a Service" due to privacy concerns, cost considerations, and customization needs.
Independent hosting of LLMs presents significant challenges related to complexity and integration with existing systems.
The proposed middleware aims to address these challenges, facilitating LLM deployment even for advanced use cases where LLMs may serve as gateways to a complete application ecosystem.
This architecture envisions LLMs potentially absorbing functionality traditionally attributed to middleware, representing a shift in system design and functionality.

Source: Towards a Middleware for Large Language Models

HULA: Human-in-the-Loop LLM Agents for Software Development

HULA is a framework that integrates human feedback with LLM-based agents to automate software development tasks, deployed in Atlassian JIRA for internal use.

The framework allows software engineers to refine and guide LLMs when generating coding plans and source code for given tasks.
Atlassian engineers reported that HULA can reduce overall development time and effort, particularly for initiating coding plans and writing code for straightforward tasks.
Some challenges related to code quality were identified, highlighting areas for improvement.
The study provides insights and opportunities for advancing LLM-based agents in software development, addressing limitations of existing approaches that rely on historical benchmark datasets without human input.

Source: Human-In-the-Loop Software Development Agents

Layered Architecture for LLM-based Software Systems

A framework that organizes LLM software system development into distinct layers, each with specific attributes, to enhance capabilities beyond basic language tasks.

The architecture addresses the gap between LLMs' native capabilities and evolving application demands.
It provides a systematic approach for implementing capabilities in effective and efficient ways, supporting desired functionalities and qualities.
The framework helps developers select suitable technologies, considering trade-offs in engineering complexity, scalability, and operational costs.
Practical case studies illustrate the utility of the layered architecture in real-world scenarios.
This approach aims to promote robustness and scalability in LLM-based software system development.

Source: A Layered Architecture for Developing and Enhancing Capabilities in Large Language Model-based Software Systems

MPDetector: Identifying Multi-Parameter Constraint Inconsistencies in Python Data Science Libraries

MPDetector is a tool designed to detect inconsistencies between code and documentation for multi-parameter constraints in data science and machine learning library APIs.

The tool addresses the challenge of maintaining correct and consistent multi-parameter constraints in API documentation, crucial for API compatibility and reliability.
MPDetector uses symbolic execution to identify constraints at the code level and employs LLMs to extract corresponding constraints from documentation.
A customized fuzzy constraint logic is implemented to reconcile LLM output unpredictability and detect logical inconsistencies between code and documentation constraints.
Evaluation on datasets from four popular data science libraries showed MPDetector's effectiveness, with a precision of 92.8% in detecting inconsistency issues.
Out of 14 reported inconsistency issues, 11 were confirmed by library developers at the time of writing.

Source: Detecting Multi-Parameter Constraint Inconsistencies in Python Data Science Libraries

Impact of AI-Generated Code Reviews: An Experimental Study

A controlled experiment with 29 experts explores the effects of incorporating automatically generated code reviews into the review process, focusing on review quality, cost, and reviewer confidence.

The study monitored over 50 hours of code reviews, comparing manual reviews to those supported by Large Language Model (LLM) generated feedback.
Reviewers generally considered the LLM-identified issues valid, but tended to focus on areas highlighted by the AI rather than exploring other parts of the code.
LLM-supported reviews resulted in more low-severity issues being identified, but did not increase the detection of high-severity issues compared to manual reviews.
Contrary to potential expectations, the use of AI-generated reviews did not reduce review time or increase reviewer confidence.
The findings suggest that while AI can influence the review process, it may not necessarily lead to improvements in all aspects of code review efficiency and effectiveness.

Source: Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?