mirror of
https://github.com/foss42/apidash.git
synced 2025-06-30 21:06:43 +08:00
Merge pull request #673 from nb923/main
[Related Issue: #618] Add GSoC 2025 Idea Proposal for AI API Eval
This commit is contained in:
@ -0,0 +1,54 @@
|
||||
# AI API Eval Framework For Multimodal Generative AI
|
||||
|
||||
## Personal Information
|
||||
- **Full Name:** Nideesh Bharath Kumar
|
||||
- **University Name:** Rutgers University–New Brunswick
|
||||
- **Program Enrolled In:** B.S. Computer Science, Artificial Intelligence Track
|
||||
- **Year:** Junior Year (Third Year)
|
||||
- **Expected Graduation Date:** May 2026
|
||||
|
||||
## About Me
|
||||
I’m **Nideesh Bharath Kumar**, a junior (third year) in Rutgers University–New Brunswick taking a **B.S. in Computer Science on the Artificial Intelligence Track**. I have a strong foundation in full stack development and AI engineering: I have project and internship experience in technologies like: **Dart/Flutter, LangChain, RAG, Vector Databases, AWS, Docker, Kubernetes, PostgreSQL, FastAPI, OAuth,** and other technologies that aid in developing scalable and AI-powered systems. I have interned at **Manomay Tech, IDEA, and Newark Science and Sustainability,** developing scalable systems and managing AI systems and completed fellowships with **Google** and **Codepath**, developing my technical skills. I’ve also won awards in hackathons, achieving **Overall Best Project in the CS Base Climate Hackathon for a Flutter-based project** and **Best Use of Terraform in the HackRU Hackathon for an Computer Vision Smart Shopping Cart**. I’m passionate about building distributed, scalable systems and AI technologies, and API Dash is an amazing tool that can facilitate in the process of building these solutions through easy visualization and testing of APIs; I believe my skills in **AI development** and experience with **Dart/Flutter** and **APIs** put me in a position to effectively contribute to this project.
|
||||
|
||||
## Project Details
|
||||
**Project Title:** AI API Eval Framework For Multimodal Generative AI
|
||||
**Description:**
|
||||
This project is to develop a **Dart-centered evaluation framework** designed to simplify the testing of generative AI models across **multiple types (text, image, code)**. This will be done by integrating evaluation toolkits: **llm-harness** for text, **torch-fidelity** and **CLIP** for images, and **HumanEval/MBPP** with **CodeBLEU** for code. This project will provide a unified config layer which can support standard and custom benchmark datasets and evaluation metrics. This will be done by providing a **user-friendly interface in API Dash** which allows the user to select model type, dataset management (local or downloadable), and evaluation metrics (standard toolkit or custom script). On top of this, **real-time visual analytics** will be provided to visualize the progress of the metrics as well as **parallelized batch processing** of the evaluation.
|
||||
|
||||
**Related Issue:** - [#618](https://github.com/foss42/apidash/issues/618)
|
||||
|
||||
**Key Features:**
|
||||
1) Unified Evaluation Configuration:
|
||||
- A config file in YAML will serve as the abstraction layer, which will be generated by the user's selection of model type, dataset, and evaluation metrics. This will redirect the config to either use llm-harness, torch-fidelity and CLIP, or HumanEval and MBPP with CodeBLEU. Additionally, custom evaluation scripts and datasets can be attached to this config file which can be interpreted by the systems.
|
||||
- This abstraction layer ensures that whether any of these specifications are different for the eval job, all of it will be redirected to the correct resources while still providing a centralized layer for creating the job. Furthermore, these config files can be stored in history for running the same jobs later.
|
||||
|
||||
2) Intuitive User Interface
|
||||
- When starting an evaluation, users can select the model type (text, image, or code) through a drop-down menu. The system will provide a list of standard datasets and use cases. The user can select these datasets, or attach a custom one. If the user does not have this dataset locally in the workspace, they can attach it using file explorer or download it from the web. Furthermore, the user can select standard evaluation metrics from a list or attach a custom script.
|
||||
|
||||
3) Standard Evaluation Pipelines
|
||||
- The standard evaluation pipelines include text, image, and code generation.
|
||||
- For text generation, llm-harness will be used, and utilize custom datasets and tasks to measure Precision, Recall, F1 Score, BLEU, ROUGE, and Perplexity. Custom integration of datasets and evaluation scores can be done through interfacing the llm-harness custom test config file.
|
||||
- For image generation, torch-fidelity can be used to calculate Fréchet Inception Distance and Inception Score by comparing against a reference image database. For text to image generation, CLIP scores can be used to ensure connection between prompt and generated image. Custom integration of datasets and evaluation scores can be done through a custom interface created using Dart.
|
||||
- For code generation, tests like HumanEval and MBPP can be used for functional correctness and CodeBLEU can be used for code quality checking. Custom integration will be done the same way as image generation, with a custom interface created using Dart for functional test databases and evaluation metrics.
|
||||
|
||||
4) Batch Evaluations
|
||||
- Parallel Processing will be supported by async runs of the tests, where a progress bar will monitor the number of processed rows in API Dash.
|
||||
|
||||
5) Visualizations of Results
|
||||
- Visualizations of results will be provided as the tests are running, providing live feedback of model performance, as well as a general summary of visualizations after all evals have been run.
|
||||
- Bar Graphs: These will be displayed from a range of 0 to 100% accuracy to visualize a quick performance comparison across all tested models.
|
||||
- Line Charts: These will be displayed to show performance trends over time of models, comparing model performance across different batches as well as between each model.
|
||||
- Tables: These will provide detailed summary statistics about scores for each model across different benchmarks and datasets.
|
||||
- Box Plots: These will show the distribution of scores per batch, highlighting outliers and variance, while also having side-by-side comparisons with different models.
|
||||
|
||||
6) Offline and Online Support
|
||||
- Offline: Models that are offline will be supported by pointing to the script the model uses to run, and datasets that are locally stored.
|
||||
- Online: These models can be connected for eval through an API endpoint, and datasets can be downloaded with access to the link.
|
||||
|
||||
**Architecture:**
|
||||
1) UI Interface: Built with Dart/Flutter
|
||||
2) Configuration Manager: Built with Dart, uses YAML for config file
|
||||
3) Dataset Manager: Built with Dart, REST APIs for accessing endpoints
|
||||
4) Evaluation Manager: Built with a Dart - Python layer to manage connections between evaluators and API Dash
|
||||
5) Batch Processing: Built with Dart Async requests
|
||||
6) Visualization and Results: Built with Dart/Flutter, using packages like fl_chart and syncfusion_flutter_charts
|
Reference in New Issue
Block a user