Content
# green-white-agent-personagym
Green and White Agent implementation built on top of the a2a framework for PersonaGym.
## Prerequisites
- Python 3.10 or higher.
- Access to a terminal or command prompt.
- Git, for cloning the repository.
- A code editor (e.g., Visual Studio Code) is recommended.
## Clone the Repository
If you haven't already, clone the repository:
```bash
git clone https://github.com/crocoro-familiy/green-white-agent-personagym.git
cd green-white-agent-personagym
```
## Python Environment & SDK Installation
We recommend using a virtual environment for Python projects. The A2A Python SDK uses `uv` for dependency management, but you can use `pip` with `venv` as well.
1. **Create and activate a virtual environment:**
Using `venv` (standard library):
=== "Mac/Linux"
```sh
python -m venv .venv
source .venv/bin/activate
```
=== "Windows"
```powershell
python -m venv .venv
.venv\Scripts\activate
```
2. **Install needed Python dependencies along with the A2A SDK and its dependencies:**
```bash
pip install -r requirements.txt
```
## Verify Installation
After installation, you should be able to import the `a2a` package in a Python interpreter:
```bash
python -c "import a2a; print('A2A SDK imported successfully')"
```
If this command runs without error and prints the success message, your environment is set up correctly.
---
## Project Structure
```
project-root/
├── code/ # Core agent logic, scripts, and HTML tools
│ ├── set_green_agent.py
│ ├── set_white_agent.py
│ ├── persona-eval-redesign-crocoro.html
│ └── ...
├── prompts/ # Prompt templates used for persona evaluation
├── specialists/ # Specialist agent definitions or role profiles
├── specialist_questions/ # Specialist-specific questions
├── questions/ # Static evaluation questions
├── rubrics/ # Rubrics for evaluation
├── pics/ # Images for documentation or visualization
├── README.md
└── requirements.txt # Python dependencies
```
## New Feature - Specialists for Domain-Specific Questions
The Persona Gym benchmark now can use specialist functuons - with domain-specific questions to test the persona agent. To add a custom specialist, you may need to prepare a specialist .json file, a set of domain specific questions and a new rubrics:
<img src="pics/specialist.png" alt="Specialist" width="80%">
#### Step 1: Add a specialist
Create a specialist configuration as a `.json` file and save it in the `specialists/` folder.
#### Step 2: Add domain-specific questions
Create a list of domain-specific questions and save it in the `specialist_questions/` folder.
#### Step 3: Add a domain-specific rubric
Create a new domain-specific rubric and save it in the `rubrics/` folder.
## Command Prompt Usage Guide
We provide two ways to set up agents - command prompt and GUI. This section will guide you through using the command prompt.
To set up the green agent server (Terminal 1):
```bash
python set_green_agent.py
```
We have already provided a mock up white agent for you to test the code. You may then set up the white agent (Terminal 2):
```bash
python set_white_agent.py
```
Finally, you just need to run the kickoff script, the green agent will then "talk" to the white agent and evaluation starts:
```bash
python kick_off.py
```
Please note the evaluation for one persona can take up to around 15 to 20 min for 10 questions per task. To facilitate your testing, we change 1 question per task but you may change it back to 10 questions to fully follow the original benchmark setting. Also, remember to set up the OepnAI API key:
```bash
export OPENAI_API_KEY="your_api_key_here"
```
Instead of setting env variables, you may also directly copy the key to `api.py`.
## GUI Usage Guide
The Persona Evaluation Toolkit provides a web interface for evaluating White Agents using the Green Agent evaluator. This section will guide you through using the GUI.
### Quick Start: Steps to Use the GUI
#### Step 1: Start the Required Services
Before using the GUI, you need to start three services in separate terminal windows:
**Terminal 1 - Start the Green Agent (Evaluator):**
```bash
cd code
python set_green_agent.py
```
The Green Agent will start on port **9999** by default.
**Terminal 2 - Start White Agent(s):**
You can run one or multiple White Agents. Each White Agent should run on a different port:
```bash
# White Agent 1 (port 8001)
cd code
python set_white_agent.py
# You can create additional White Agents by copying set_white_agent.py
# and modifying the port number and copy and revise white_agent_card.toml and then run it in the new terminal
# For instance:
# White Agent 2 (port 8002)
cd code
python set_white_agent_2.py
```
**Terminal 3 - Start the Web Server (GUI):**
```bash
cd code
python web_server.py
```
The web server will start on port **8080** by default.
#### Step 2: Open the GUI in Your Browser
Navigate to:
```
http://localhost:8080
```
You should see the Persona Evaluation Toolkit interface with a beautiful green-themed design.

*Figure 1: Main GUI Interface - The Persona Evaluation Toolkit dashboard*
#### Step 3: Connect to a White Agent
1. **Enter White Agent URL**: In the first test card (White Agent 1), you'll see a default URL `http://localhost:8001`. You can modify this or add URLs for additional agents.
2. **Click "🔗 Connect to White Agent"**: This button will:
- Connect to the White Agent at the specified URL
- Fetch the persona description from the agent
- Display the persona information in the card
3. **Verify Connection**: Once connected, you'll see:
- A success message showing "Connected!"
- The persona description displayed below the connect button
- The status changes to show the connection is ready
#### Step 4: Run Evaluation
1. **Click "▶️ Run Test"**: This will start the evaluation process. The Green Agent will:
- Analyze the White Agent's persona
- Generate evaluation questions
- Interact with the White Agent
- Score the responses
- Display results in real-time

*Figure 2: Connecting to a White Agent - The system fetches and displays the persona description*
2. **Watch Real-Time Progress**: A modal window will open showing:
- Connection status
- Question generation progress
- Question-answer pairs as they're collected
- Scoring progress
- Final results

*Figure 3: Evaluation in Progress - Real-time display of questions, answers, and scoring*
#### Step 5: View Results
After evaluation completes:
1. **View Detailed Scores**: Click the "📊 View Detailed Score Report" button that appears in the evaluation modal
2. **Score Breakdown**: The score modal shows:
- Overall Persona Score (large circular display)
- Per-task scores with detailed reasoning
- Score reasons explaining why each score was given

*Figure 4: Evaluation Results - Detailed scores with reasoning for each evaluation dimension*
3. **Last Score Display**: The test card will show a "Last Score" section with:
- Overall Persona Score badge
- Individual task score badges
- Quick reference for comparison

*Figure 5: Score Summary - Last evaluation scores displayed on the test card*
4. **Result Files**: Each evaluation result is automatically saved as a JSON file in `code/agent_results/` directory with a unique session ID. You can review these files later for detailed analysis or comparison.
---
### 🚀 GUI Features
**Multiple White Agents**: Support for up to 3 agents simultaneously. Connect each test card to a different White Agent URL (e.g., `http://localhost:8001`, `http://localhost:8002`). Use "▶️ EVALUATE ALL AGENTS" to run evaluations sequentially.
**Score Display**: View detailed scores with reasoning for each evaluation dimension (Expected Action, Toxicity, Linguistic Habits, Persona Consistency, Action Justification). Click "📊 View Detailed Score Report" after evaluation completes.
**Reopen Evaluations**: The "🔄 Reopen Last Evaluation" button appears after each evaluation, allowing you to review the complete conversation history.
**Real-Time Updates**: Watch evaluation progress in real-time, including question-answer pairs, specialist domain detection, and scoring progress.
**Bulk Operations**: Export results with "📤 SHARE WITH DEVELOPER" or reset everything with "🗑️ CLEAR ALL RECORDS".
**Result Persistence**: All evaluation results are automatically saved as JSON files in `code/agent_results/` with unique session IDs, allowing you to review and compare results later.
---