Content
# NewsMind 🤖
## System Webpage Effect Display

## Feature Overview 🌟
- 🔍 Generate news-related information based on LLM.
- 📰 The RAG system automatically stores and recalls retrieved news content.
- 🌐 Supports bilingual news search and processing in Chinese and English.
- 🧠 Article filtering and content relevance assessment based on LLM.
- 📊 Automatically integrates multi-domain news content to generate structured reports.
- 📧 Automatic email distribution system, supporting multiple recipients.
- ⏰ Supports scheduled execution.
- 🤖 Central agent automatically schedules all tasks.
## Implementation Principle 🏗️
This system adopts a central agent for unified scheduling, a multi-tool agent collaborative architecture, combined with API news retrieval, LangChain agent orchestration, and LlamaIndex to build an Agentic RAG system, achieving efficient and intelligent news aggregation and distribution.
### System Architecture and Process
This system is completed by a core intelligent agent and four tool agents:
1. **NewsMind Agent - Core**
- As the central task scheduling agent, it uses DeepSeek-R1-Distill-Qwen-14B as the central processor. Upon receiving user instructions, it can automatically schedule all tasks. Its main functions include:
- Determine whether to call tool agents; if not needed, it engages in dialogue with the user using the model's own knowledge.
- Decide the type of agent to call (search / rag / none); for news-related questions, it determines whether to search for the latest news or recall historical news.
- Extract keywords from user questions to direct the search/retrieval of news in areas of interest to the user.
- Implementation Method: Function call
2. **RAG Agent**
Built on the LlamaIndex framework, combined with the Qdrant vector database to achieve efficient Retrieval-Augmented Generation. The main functions of this agent include:
- Receive scheduling requests from the NewsMind agent and perform vector retrieval based on user-provided questions and keywords.
- Query the pre-built news corpus vector index in Qdrant based on keywords to recall relevant historical news content.
- Support vectorized storage and retrieval of multi-domain news, ensuring long-term memory capability for historical topics.
3. **News Retrieval Agent**
- Based on the LangChain framework, it calls the qwen-plus large model to dynamically generate keyword pairs in both Chinese and English across all domains, as well as subsequent news filtering prompts, saving them to `domainsConf.json`.
- The large model, combined with domain-specific prompts, filters or expands highly relevant keywords from the `domainsConf.json` file to generate keyword combinations for news retrieval.
- Using the above keyword combinations, it automatically calls the news API to retrieve the latest news, supporting multilingual and multi-domain parallel retrieval, enhancing information coverage.
4. **Content Integration and Filtering Agent**
- Deduplicates, aggregates, and structures the retrieved news.
- Evaluates the relevance and representativeness of news content based on the qwen-plus large model, filtering high-value information to ensure diverse and representative results.
5. **Email Distribution Agent**
- Automatically generates structured news briefs and email subjects.
- Reads the `EMAIL_RECEIVER` field from the `.env` file as the recipient's email address (if there are multiple emails, please separate them with spaces).
- Sends the integrated results to the configured recipient's email, supporting multiple recipients and scheduled execution.
---
### Fine-tuning Large Models
#### 1. DeepSeek-R1-Distill-Qwen-14B
- 🚫 DeepSeek-R1-Distill-Qwen-14B does not natively support OpenAI's standard function call format. OpenAI's function call is a special message structure and calling convention, including:
```json
{
"role": "assistant",
"function_call": {
"name": "xxx",
"arguments": "{...}"
}
}
```
This structure requires the model to have integrated understanding and generation capabilities for this format during the training phase and to support automatic function call triggering. However:
🚫 The DeepSeek-R1-Distill-Qwen series is an open-source large model that does not natively support OpenAI's tool calling protocol or function_call structure.
- ✅ Alternative Solution
By using **instruction tuning**, DeepSeek-R1-Distill-Qwen-14B learns to generate a JSON structure similar to OpenAI's function call for subsequent parsing and invocation in the program. Its form simulates outputting a pseudo function_call format JSON, for example:
```json
{
"function_call": {
"name": "newsmind_agent",
"arguments": "{\"use_agent\": true, \"agent_type\": \"search\", \"keywords\": [\"autonomous vehicles\", \"safety\"]}"
}
}
```
This format is used as a training target, constructing data samples similar to the following for fine-tuning:
```json
{
"messages": [
{
"role": "system",
"content": "You are an intelligent agent scheduling assistant. Please determine whether to call an agent based on the user's question and provide the type and keywords."
},
{
"role": "user",
"content": "How to ensure the safety of autonomous vehicles?"
},
{
"role": "assistant",
"function_call": {
"name": "newsmind_agent",
"arguments": "{\"use_agent\": true, \"agent_type\": \"search\", \"keywords\": [\"autonomous vehicles\", \"safety\"]}"
}
}
]
}
```
During the inference phase, the model output will include the above function_call field, which can be parsed through code to achieve the effect of a "pseudo OpenAI function call."
#### 2. Qwen-Plus-14B
In practical applications, when the API retrieves news based on keywords, there often occurs a phenomenon of domain confusion. For example, when searching for news related to "biology," results often include irrelevant content from fields such as AI and economics. This noise greatly affects the accuracy of downstream news aggregation and distribution. Therefore, we focus on **automatic classification and domain identification of news titles** to enhance the overall system's relevance and intelligence.
Implementation Principle:
- 💡 Fine-tune Qwen-Plus-14B to implement a news title classifier.
This project fine-tunes the Qwen-Plus-14B model to enable it to classify news titles by domain (e.g., technology / economy /...). This task is a typical instruction tuning scenario, aiming to have the model determine the domain of the news title based on user input.
- 🎯 Task Format (Instruction Tuning)
The user input (prompt) is a news title, for example:
**"AI chips are reshaping the semiconductor industry."**
The model output (completion) is the corresponding domain label, for example:
**technology**
- Each training sample contains a dialogue history structure, simulating the interaction process between the user and the model, as shown below:
```json
{
"messages": [
{
"role": "user",
"content": "AI chips are reshaping the semiconductor industry."
},
{
"role": "assistant",
"content": "technology"
}
]
}
```
## Project Structure 📁
```
├── README.md # Description
├── __init__.py
├── configs # Configuration directory
│ ├── example.env # Configuration file template
├── log # Log directory
│ ├── __init__.py
│ ├── logger.py # Log class
│ └── logs # Log output directory
│ └── NewsMind.log # Log output file
├── main.py # Project startup
├── newsmind_agent.py # Core agent
├── tool_agents # Tool agents
│ ├── __init__.py
│ ├── base_agent.py # Base class agent
│ ├── integrator # News aggregation agent
│ │ ├── __init__.py
│ │ ├── agent.py
│ │ └── email_template.html # Email template
│ ├── mailer # Email sending agent
│ │ ├── __init__.py
│ │ └── agent.py
│ ├── retriever # RAG agent
│ │ ├── __init__.py
│ │ └── rag_agent.py
│ └── search # News search agent
│ ├── __init__.py
│ ├── agent.py
│ ├── domainsConf_generator.py # News configuration generator class
│ ├── news_collector.py # News collection class
│ └── news_processor.py # News post-processing class
├── web # Deploy web front-end interface
| ├── app.py # Project web startup
└── post-training # Post-training for deepseek, qwen
```
The post-training directory has no backup and was accidentally deleted; I am currently working hard to recover it. 😭...
## Environment Configuration 🛠️
### Dependency Installation 📦
```bash
pip install -r requirements.txt
```
### Environment Variable Configuration ⚙️
Copy the environment variable template file:
```bash
cp .env.example .env
```
Configure the contents of the `.env` file:
```
# API Configuration
NEWS_API_KEY=your_news_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
# Email Service Configuration
EMAIL_SENDER=your_email@example.com
EMAIL_PASSWORD=your_email_password
EMAIL_RECEIVER=recipient1@example.com,recipient2@example.com,2563374153@qq.com
SMTP_SERVER=smtp.example.com
SMTP_PORT=587
# Scheduling Configuration
SCHEDULE_TIME=07:00
```
Configuration Explanation:
- 🔑 `NEWS_API_KEY`: NewsAPI interface key
- 🔑 `OPENAI_API_KEY`: OpenAI API key for content generation and filtering
- 📨 `EMAIL_SENDER`: Sender's email address
- 🔒 `EMAIL_PASSWORD`: Sender's email password or application-specific password
- 📩 `EMAIL_RECEIVER`: Recipient's email address, separated by commas for multiple addresses
- 🖥️ `SMTP_SERVER`: SMTP server address
- 🔌 `SMTP_PORT`: SMTP server port
- 🕗 `SCHEDULE_TIME`: Scheduled execution time (HH:MM format)
## Usage 📋
### Standard Mode 🚀
```bash
python main.py
```
Starts the complete news aggregation process, including LLM keyword generation, news retrieval, content integration, and email distribution.
### Preset Keyword Mode 🔍
```bash
python main.py --hard=true
```
Uses the system's preset keyword combinations for searching, outputting detailed retrieval results for debugging and verification. This mode still executes the complete content integration and email distribution process.
### Email Test Mode 📧
```bash
python main.py --sent=false
```
Only executes the news collection and content integration process without sending emails, suitable for development and testing phases.
Parameter Explanation:
- 🔄 `--hard=true`: Enables preset keyword combinations and outputs detailed retrieval results
- 📧 `--sent=false`: Disables email sending functionality, only executing news collection and content integration
## Advanced Configuration 🔧
### Keyword Customization 🔤
You can customize the preset keyword library for each domain by modifying the `agents/keywords.json` file. The system will generate search query combinations based on this.
### Domain Prompt Configuration 🎯
The `domain_prompts` dictionary in `search_agent.py` defines the article filtering criteria for each domain, which can be adjusted according to needs.
## Troubleshooting ❓
API Key Errors:
- Check the configuration in the `.env` file and the status of environment variable loading.
Search Result Quality Issues:
- Run with the `--hard=true` parameter for result verification.
- Review and optimize the keyword configuration in `keywords.json`.
- Adjust the domain filtering parameters in `search_agent.py`.
## Advanced Section: Large Model Fine-tuning and Title Classification Accuracy Improvement 🧬
### Background and Challenges
In practical applications, when NewsAPI retrieves news based on keywords, there often occurs a phenomenon of domain confusion. For example, when searching for news related to "biology," results often include irrelevant content from fields such as AI and economics. This noise greatly affects the accuracy of downstream news aggregation and distribution. Therefore, we focus on **automatic classification and domain identification of news titles** to enhance the overall system's relevance and intelligence.
### Technical Route and Implementation Principle
This section is based on the **DeepSeek-R1-Distill-Qwen-14B** large model, combined with the efficient fine-tuning method of LORA parameters, to build a high-precision news title classifier. The overall process is as follows:
### Experimental Platform
The hardware environment of the experimental platform for this project is as follows:
- The server is equipped with 8 NVIDIA A800 80G GPUs, with CPU and memory information as follows:

- This experiment only used 2 A800 GPUs for distributed fine-tuning:

1. **Data Collection and Preprocessing**
- Use `utils/news_fetch.py` to automatically scrape news titles from multiple domains and store them in the `dataset/` directory.
- Through `utils/news_title_classifier.py`, call the self-built GPT-4o-mini gateway to label the collected titles with domain tags, forming a high-quality labeled dataset.
- Use `utils/prepare_fine_tuning_data.py` to convert the labeled data into a format suitable for fine-tuning the large model (e.g., jsonl) for subsequent training.
2. **Large Model Fine-tuning**
- Single GPU fine-tuning: Use `finetune_deepseek.py` to efficiently fine-tune DeepSeek-R1-Distill-Qwen-14B based on the LORA method.
- Multi-GPU distributed fine-tuning: Use `finetune_deepseek_gpus.py` to support distributed training in a multi-GPU environment, accelerating large-scale data processing.
- The fine-tuning data comes from the high-quality labeled set in the `dataset/` directory.
3. **API Service and Testing**
- `api_server.py`: Provides a local API service for the original DeepSeek-R1-Distill-Qwen-14B model, facilitating inference and comparative testing.
- `api_server_ft.py`: Deploys the fine-tuned model API service, supporting external calls and evaluations.
- `curl.py`: Contains a series of API testing scripts to verify the availability and response accuracy of the gateway service.
4. **Effect Evaluation and Comparison**
- `extract_inference_dataset.py`: Extracts titles and standard answers from the labeled dataset to construct the evaluation set.
- `inference_evaluation.py` and `inference_evaluation_ft.py`: Conduct inference tests on the original model and the fine-tuned model, respectively, to calculate their title classification accuracy and quantify the fine-tuning effect.
### Directory Structure
```
.post-training
├── api_server_ft.py # API service for the fine-tuned model
├── api_server.py # API service for the original model
├── curl.py # API testing script to verify gateway availability
├── dataset # Dataset directory
│ ├── inference.json # Data for inference evaluation
│ ├── inference_train.json # Data for inference training
│ ├── news_titles_20250414.txt # Scraped raw news titles
│ ├── processed_titles_20250414.json # Processed labeled titles
│ ├── train.jsonl # Fine-tuning training set
│ └── val.jsonl # Fine-tuning validation set
├── extract_inference_dataset.py # Script to construct inference evaluation set
├── finetune_deepseek_gpus.py # Multi-GPU distributed fine-tuning script
├── finetune_deepseek.py # Single GPU fine-tuning script
├── inference_evaluation_ft.py # Inference evaluation for the fine-tuned model
├── inference_evaluation.py # Inference evaluation for the original model
├── logs
│ └── imgs # Images from training/evaluation process
├── utils
│ ├── news_fetch.py # News title scraping script
│ ├── news_title_classifier.py # Title auto-labeling script
│ └── prepare_fine_tuning_data.py# Script to convert labeled data to fine-tuning format
└── wandb # Training logs and visualization
├── latest-run -> run-20250415_010644-4ybs94x1
├── run-20250414_190944-ot0jrwum
├── run-20250414_191602-hswcei9x
└── run-20250415_010644-4ybs94x1
```
### CookTricks
#### Intelligent Label Generation and Category Balancing Mechanism
This project introduces an intelligent label generation and category balancing mechanism during the data preprocessing stage. `news_title_classifier.py` not only implements automated domain identification for news titles but also dynamically generates supplementary samples based on the current category distribution, effectively alleviating the issue of category imbalance. This mechanism enhances the representativeness and fairness of the training data, significantly improving the model's generalization ability on minority classes. The final generated `processed_titles_20250414.json` file contains both real collected titles and AI-generated balanced supplementary samples, distinguished by the `is_generated` field for flexible filtering and analysis later.
#### Distributed Parallel Training Architecture
The model training section adopts a distributed parallel training architecture, supporting efficient fine-tuning in a single machine with multiple GPUs. By using the `finetune_deepseek_gpus.py` script and PyTorch DDP (Distributed Data Parallel) mechanism, users can fully utilize multi-GPU resources with a single command, significantly shortening the training cycle. This architecture not only enhances the efficiency of the current experiment but also provides a solid engineering foundation for future large-scale, multi-task, multi-domain training of large models.
```bash
torchrun --nproc_per_node=2 autonews-agent/post-training/finetune_deepseek_gpus.py
```
#### LoRA Parameter Efficient Fine-tuning Method
This project employs the LoRA (Low-Rank Adaptation) parameter efficient fine-tuning method during the fine-tuning phase. LoRA introduces low-rank decomposition to part of the weight matrix, training only a minimal number of learnable parameters, greatly reducing memory and computational consumption. Its core idea can be expressed as:
$$
W = W_0 + \Delta W = W_0 + BA
$$
Where,
$$
A \in \mathbb{R}^{r \times d}
$$
$$
B \in \mathbb{R}^{d \times r}
$$
$$
r \ll d
$$
By training only the low-rank matrices A and B, efficient adaptation of the large model can be achieved. The LoRA method is suitable for fine-tuning large models in resource-constrained environments and naturally fits distributed parallel training scenarios, facilitating large-scale application deployment.
### DevFlow
#### 1. Screenshot after running news_title_classifier.py

#### 2. Screenshot of running LORA fine-tuning finetune_deepseek code (including parameter information)

#### 3. Screenshot of starting dual GPU A800 training with finetune_deepseek_gpus.py

#### 4. Screenshot of code results after 4 hours of dual GPU training

#### 5. Configuration information for this training provided by Weight & Bias

#### 6. Visualization of loss and dynamic learning rate during training (loss dropped from 10.7 to 0.11)

#### 7. During the eval process, loss dropped from 0.24 to 0.1, results met expectations and aligned with theoretical predictions

#### 8. Screenshot of evaluating the fine-tuned model against the original model with inference_evaluation.py code (2 worker dual-thread, API inference)

#### 9. Screenshot of running inference_evaluation.py (loading LORA, merging weights, direct inference)

## Further Outlook: Multi-Agent Collaboration 🤖🤝🤖
### MCP + A2A Collaboration Paradigm 🛠️🔗
In the world of agents, going solo is outdated! The trend now is—
- **MCP (Multi-Component Planning) 🛠️**
It's like giving all agents a "tool manual" 📚, so everyone uses tools according to a unified standard. Whether it's looking up information, doing math, or sending emails, agents can flexibly call various external tools just like humans, and whoever uses them well becomes the "tool king" 👑! The essence of MCP is: **Let agents use tools; the more tools, the stronger!**
- **A2A (Agent-to-Agent) 🤝**
You have your skills, I have my expertise. A2A allows agents to chat, collaborate, and share information, completing large tasks like a team. When faced with challenges, everyone brainstorms together, doubling efficiency!
- **MCP + A2A Fusion Paradigm 🚀**
With tools used skillfully and partners cooperating well, the agent team can achieve anything! MCP + A2A combines unified tool standards and efficient collaboration between agents, suitable for creating super-large-scale, highly capable agent "alliances" ⚡.
#### MCP

#### A2A

#### MCP + A2A

---
### Agent SDK 🧩✨
To let agents "team up to tackle challenges," a good "development toolkit" is essential—this is the **Agent SDK**!
- **Core Idea 🧠**
The Agent SDK acts like a "universal remote control" 🎮 for agents, helping you standardize and modularize each agent. Developers only need to write business logic, while the SDK automatically manages communication, scheduling, and tool invocation, making it easy and efficient!
- **Main Functions 🛎️**
- Agent registration and discovery: Agents become "visible" upon arrival, with capabilities clear at a glance 👀.
- Tool integration: One line of code connects external APIs, databases, and search engines, making tool usage so easy for agents 🔌.
- Message routing and task orchestration: Messages pass quickly between agents, and task distribution is orderly 📬.
- Lifecycle and exception management: Agent issues? The SDK has your back, stable as a rock 🦺.
- **Application Value 💡**
With the Agent SDK, the development efficiency of multi-agent systems goes up, maintenance and expansion become easy, and "mass production" of agent applications is no longer a dream!
#### Core Process of Agent SDK (This image is generated based on ChatGPT's native image functionality)

#### Agent SDK demo example (using OpenAI Agent SDK as an example)

#### Special Thanks: Thanks to OpenAI for providing free `GPT-4.1-mini` model credits
---
## License 📄
[MIT License](LICENSE)