Content
# 📊 A2A Protocol Demo 12: Observability & Evaluation
> **Multi-Agent System Full-Link Monitoring and Evaluation System**
>
> A Comprehensive Observability Suite for A2A Protocols: Metrics, Tracing, and Performance Analysis.
---
## 📖 1. Core Concepts and Background
When building enterprise-level Agent systems, we face challenges that are more severe than traditional software: the **"Black Box Problem."**
* **Non-Determinism**: The same prompt may yield different outputs from the LLM each time.
* **Information Loss**: Does context compression (Demo 4/10) lose critical information?
* **Complex Linkage**: A request may go through Researcher -> Writer -> Reviewer; where is the slowdown? Where did it fail?
This project (Demo 12) provides a complete solution, including **Quantitative Evaluation Metrics** and **Distributed Tracing**.
---
## 📐 2. Key Evaluation Metrics
This system introduces two core metrics to quantify the quality of **Context Compression**.
### 2.1 IRR (Information Retention Rate)
Measures whether the compressed summary retains the key semantics from the original context (Slots).
* **Principle**: Calculate the **average cosine similarity** between the original Slot vector set and the compressed vector.
* **Formula**:
$$ IRR = \frac{1}{N} \sum_{i=1}^{N} \text{CosSim}(\vec{Slot_i}, \vec{Summary}) $$
* **Interpretation**:
* `> 0.85`: Excellent, key information is fully retained.
* `< 0.60`: Poor, significant information loss has occurred (e.g., missing time, location).
### 2.2 SOR (Semantic Offset Rate)
Measures whether the compressed content has undergone "mutation" or "hallucination."
* **Formula**: $SOR = 1.0 - IRR$
* **Interpretation**: The lower the value, the better. A high SOR indicates that the summary's meaning has deviated from the original text.
---
## 🕵️ 3. Distributed Tracing
To monitor multi-Agent collaboration, we simulate a tracing mechanism similar to **OpenTelemetry**.
* **Trace ID**: A unique identifier that spans the entire task lifecycle (e.g., `trace_8a2b...`).
* **Span**: Records a single atomic operation (e.g., a search, an LLM inference).
* **Latency**: Records `End Time - Start Time`, used to identify performance bottlenecks.
* **Status**: `SUCCESS` / `ERROR`, used to monitor stability.
---
## 📂 4. Project Structure
```text
a2a-protocol-demo12/
├── metrics.py # [Core Algorithm] Implements vector similarity calculation, IRR/SOR calculation
├── tracer.py # [Monitoring Component] Implements Span recording, Trace context management
├── simulator.py # [Scenario Simulation] Constructs fake Agent interaction data for testing
├── main.py # [Entry Point] Runs the evaluation demo
└── requirements.txt # Dependency libraries (numpy, scikit-learn)
```
---
## 🚀 5. Running Guide
### Step 1: Install Dependencies
The core dependencies are `numpy` (for matrix operations) and `scikit-learn` (for cosine similarity).
```bash
pip install -r requirements.txt
```
### Step 2: Run the Evaluation Program
```bash
python main.py
```
---
## 🧪 6. Experimental Result Analysis (Learning Notes)
After running the program, you will see two parts of output. Here is a detailed interpretation of the output results:
### Scenario A: Semantic Compression Quality Assessment
The program simulates compressing the phrase *"Book a window seat flight to Beijing tomorrow."*
**Output Example:**
```text
🧪 [Test 1] Slot Compression Semantic Integrity Assessment
✅ Option A (High-Quality Summary): 'Book a flight to Beijing tomorrow morning, requesting a window seat.'
📊 IRR (Retention Rate): 0.9241 <-- High score, retains "tomorrow," "Beijing," "window seat"
📊 SOR (Offset Rate): 0.0759
❌ Option B (Low-Quality Summary): 'Book a flight to Beijing.'
📊 IRR (Retention Rate): 0.6510 <-- Low score, loses the semantic vectors of "tomorrow" and "window seat"
📊 SOR (Offset Rate): 0.3490
```
> **📝 Note**: This proves that we can automate the testing of Prompt quality using mathematical methods, without needing to manually review each one.
### Scenario B: Multi-Agent Collaboration Link Tracing
The program simulates the call chain of `Planner -> Search -> Writer`.
**Output Example:**
```text
🧪 [Test 2] Multi-Agent Collaboration Link Tracing
Trace ID | Operation | Status | Latency (ms)
------------------------------------------------------------
trace_001..| Planner->Decompose | SUCCESS | 102.45
trace_001..| Search->Google | SUCCESS | 450.12 <-- Found to take longer, possibly due to network I/O
trace_001..| Writer->Generate | ERROR | 101.05 <-- Error found, Token overflow
------------------------------------------------------------
📉 Error Rate: 1/3
```
> **📝 Note**:
> 1. Through **Latency**, we can decide whether to add caching for a certain Agent.
> 2. Through **Status**, we can trigger an automatic retry mechanism (Retry Policy).
---
## 🧠 7. Architectural Considerations
In the enterprise-level A2A architecture, this module is located in the **Support Layer**:
```mermaid
graph TD
subgraph "Agent Runtime"
Core[Business Agent]
end
subgraph "Support Layer (Demo 12)"
Metrics[Evaluation Service]
Tracer[Tracing Service]
end
Core --> |1. Send Slot| Metrics
Metrics --> |2. Return IRR Score| Core
Core --> |3. Report Span| Tracer
Tracer --> |4. Generate Dashboard| Dashboard
```
## 📝 License
MIT