Add comprehensive comparison of RAG evaluation methods

This guide addresses the important question: "Is RAGAS the universally accepted standard?"

**TL;DR:**
❌ RAGAS is NOT a universal standard
✅ RAGAS is the most popular open-source RAG evaluation framework (7k+ GitHub stars)
⚠️ RAG evaluation has no single "gold standard" yet - the field is too new

**Content:**

1. **Evaluation Method Landscape:**
   - LLM-based (RAGAS, ARES, TruLens, G-Eval)
   - Embedding-based (BERTScore, Semantic Similarity)
   - Traditional NLP (BLEU, ROUGE, METEOR)
   - Retrieval metrics (MRR, NDCG, MAP)
   - Human evaluation
   - End-to-end task metrics

2. **Detailed Framework Comparison:**

   **RAGAS** (Most Popular)
   - Pros: Comprehensive, automated, low cost ($1-2/100 questions), easy to use
   - Cons: Depends on evaluation LLM, requires ground truth, non-deterministic
   - Best for: Quick prototyping, comparing configurations

   **ARES** (Stanford)
   - Pros: Low cost after training, fast, privacy-friendly
   - Cons: High upfront cost, domain-specific, cold start problem
   - Best for: Large-scale production (>10k evals/month)

   **TruLens** (Observability Platform)
   - Pros: Real-time monitoring, visualization, flexible
   - Cons: Complex, heavy dependencies
   - Best for: Production monitoring, debugging

   **LlamaIndex Eval**
   - Pros: Native LlamaIndex integration
   - Cons: Framework-specific, limited features
   - Best for: LlamaIndex users

   **DeepEval**
   - Pros: pytest-style testing, CI/CD friendly
   - Cons: Relatively new, smaller community
   - Best for: Development testing

   **Traditional Metrics** (BLEU/ROUGE/BERTScore)
   - Pros: Fast, free, deterministic
   - Cons: Surface-level, doesn't detect hallucination
   - Best for: Quick baselines, cost-sensitive scenarios

3. **Comprehensive Comparison Matrix:**
   - Comprehensiveness, automation, cost, speed, accuracy, ease of use
   - Cost estimates for 1000 questions ($0-$5000)
   - Academic vs industry practices

4. **Real-World Recommendations:**

   **Prototyping:** RAGAS + manual sampling (20-50 questions)
   **Production Prep:** RAGAS (100-500 cases) + expert review (50-100) + A/B test
   **Production Running:** TruLens/monitoring + RAGAS sampling + user feedback
   **Large Scale:** ARES training + real-time eval + sampling
   **High-Risk:** Automated + mandatory human review + compliance

5. **Decision Tree:**
   - Based on: ground truth availability, budget, monitoring needs, scale, risk level
   - Helps users choose the right evaluation strategy

6. **LightRAG Recommendations:**
   - Short-term: Add BLEU/ROUGE, retrieval metrics (Recall@K, MRR), human eval guide
   - Mid-term: TruLens integration (optional), custom eval functions
   - Long-term: Explore ARES for large-scale users

7. **Key Insights:**
   - No perfect evaluation method exists
   - Recommend combining multiple approaches
   - Automatic eval ≠ completely trustworthy
   - Real user feedback is the ultimate standard
   - Match evaluation strategy to use case

**References:**
- Academic papers (RAGAS 2023, ARES 2024, G-Eval 2023)
- Open-source projects (links to all frameworks)
- Industry reports (Anthropic, OpenAI, Gartner 2024)

Helps users make informed decisions about RAG evaluation strategies beyond just RAGAS.

2025-11-19 13:36:56 +00:00

20 KiB

Raw Blame History

RAG 评估方法全面对比

快速回答

RAGAS 是否是普遍接受的标准？

❌ 不是。RAGAS 是一个流行的开源工具，但不是行业统一标准。

✅ 但它是：

目前最广泛使用的开源 RAG 评估框架之一（GitHub 7k+ stars）
学术界引用较多（2023 年发布后被多篇论文引用）
工业界采用较广（许多公司使用或参考）

⚠️ 现实情况：

RAG 评估还没有统一的"黄金标准"（field is too new）
不同公司/团队使用不同的评估方法
大多数严肃的应用会结合多种评估方法

RAG 评估方法全景图

评估方法分类

RAG 评估方法
├── 1. 基于 LLM 的自动评估（RAGAS 属于这类）
│   ├── RAGAS
│   ├── ARES
│   ├── TruLens
│   └── G-Eval
│
├── 2. 基于嵌入相似度的评估
│   ├── BERTScore
│   ├── Semantic Similarity
│   └── MoverScore
│
├── 3. 传统 NLP 指标
│   ├── BLEU
│   ├── ROUGE
│   ├── METEOR
│   └── Exact Match (EM)
│
├── 4. 基于检索质量的指标
│   ├── MRR (Mean Reciprocal Rank)
│   ├── NDCG (Normalized Discounted Cumulative Gain)
│   ├── MAP (Mean Average Precision)
│   └── Hit Rate
│
├── 5. 人工评估
│   ├── 专家评分
│   ├── 众包标注
│   └── A/B Testing
│
└── 6. 端到端任务指标
    ├── Exact Match (QA 任务)
    ├── F1 Score (QA 任务)
    └── 任务成功率（对话/推荐等）

主流评估框架对比

1. RAGAS（目前最流行）

核心特点：

使用 LLM 自动评估 RAG 系统
4 个核心指标：Faithfulness, Answer Relevancy, Context Recall, Context Precision
开源、易用、社区活跃

优点：

✅ 全面性：覆盖检索和生成两个阶段
✅ 自动化：无需人工标注
✅ 成本低：评估 100 问题 ~$1-2（使用 GPT-4o-mini）
✅ 灵活性：支持自定义指标
✅ 易用性：与 LangChain 等框架集成好
✅ 社区支持：文档完善、更新活跃

缺点：

❌ 依赖评估 LLM：评估质量受限于评估模型能力
❌ 需要 Ground Truth：对于某些应用难以获得
❌ 评估成本：大规模评估时 API 调用成本累积
❌ 非确定性：同一输入可能得到不同评分
❌ 可能过于乐观：LLM 可能对 LLM 生成的内容更宽容

适用场景：

✅ 快速原型验证
✅ 对比不同配置/模型
✅ 持续监控（配合采样）
❌ 高风险应用的最终验证（需要人工审核）

GitHub: https://github.com/explodinggradients/ragas Stars: 7.3k+ 发布时间: 2023 年中

2. ARES（Automated RAG Evaluation System）

核心理念：使用合成数据训练小型判别模型来评估 RAG 系统，而不是直接使用 LLM。

工作流程：

1. 生成合成评估数据
   使用 LLM 生成：(question, context, answer, label)

2. 训练轻量级分类器
   训练小模型（如 DeBERTa）判断答案质量

3. 使用训练好的模型评估
   用小模型评估实际 RAG 输出（无需调用 LLM）

优点：

✅ 成本低：训练后无需 API 调用
✅ 速度快：本地模型推理比 LLM API 快得多
✅ 一致性：避免 LLM 的非确定性
✅ 隐私友好：数据不需要发送到第三方 API

缺点：

❌ 前期成本高：需要生成大量合成数据和训练
❌ 领域特异性：需要为不同领域重新训练
❌ 灵活性差：不如 LLM 评估灵活
❌ 冷启动问题：新领域没有足够数据时难以使用

适用场景：

✅ 大规模生产环境（评估量大，训练成本可分摊）
✅ 隐私敏感场景（不能发送到外部 API）
✅ 延迟敏感场景（需要实时评估）
❌ 快速原型（训练成本高）

论文: "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems" 发布: Stanford, 2024 代码: https://github.com/stanford-futuredata/ARES

3. TruLens（Observability & Evaluation）

核心特点：不仅是评估框架，更是可观测性平台，提供实时监控和调试。

独特功能：

1. 实时追踪：记录每次 RAG 调用的完整轨迹
2. 反馈函数：类似 RAGAS 的评估指标，但更灵活
3. 可视化：Web UI 查看评估历史和趋势
4. 对比实验：并排对比不同配置
5. 根因分析：定位质量问题的具体环节

评估指标：

# TruLens 提供的反馈函数
from trulens_eval.feedback import Feedback

# 1. Groundedness（类似 RAGAS Faithfulness）
f_groundedness = Feedback(provider.groundedness_measure)

# 2. Answer Relevance（类似 RAGAS Answer Relevancy）
f_answer_relevance = Feedback(provider.relevance)

# 3. Context Relevance（类似 RAGAS Context Precision）
f_context_relevance = Feedback(provider.context_relevance)

# 4. 自定义指标
def custom_metric(input, output):
    # 任意自定义逻辑
    return score

优点：

✅ 可观测性：提供完整的调用链追踪
✅ 可视化：直观的 Web UI
✅ 灵活性：支持自定义评估函数
✅ 实时监控：可用于生产环境
✅ 多框架支持：支持 LangChain, LlamaIndex 等

缺点：

❌ 复杂度高：学习曲线陡峭
❌ 依赖较重：需要运行额外的服务
❌ 成本：评估仍依赖 LLM API 调用
❌ 过度设计：对于简单场景可能太重

适用场景：

✅ 生产环境监控
✅ 复杂 RAG 系统调试
✅ 需要可观测性的企业应用
❌ 简单的批量评估（RAGAS 更轻量）

GitHub: https://github.com/truera/trulens Stars: 2k+ 公司: TruEra

4. LlamaIndex 评估模块

核心特点： LlamaIndex 自带的评估工具，与 LlamaIndex 框架深度集成。

评估方法：

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
)

# 1. Faithfulness（忠实度）
faithfulness_evaluator = FaithfulnessEvaluator()

# 2. Relevancy（相关性）
relevancy_evaluator = RelevancyEvaluator()

# 3. Correctness（正确性）- 与 ground truth 对比
correctness_evaluator = CorrectnessEvaluator()

# 4. Semantic Similarity（语义相似度）
similarity_evaluator = SemanticSimilarityEvaluator()

优点：

✅ 原生集成：与 LlamaIndex 无缝集成
✅ 零配置：LlamaIndex 用户可直接使用
✅ 灵活性：支持多种评估策略
✅ 批量评估：内置批量评估工具

缺点：

❌ 绑定框架：主要为 LlamaIndex 设计
❌ 功能有限：不如 RAGAS/TruLens 全面
❌ 文档较少：相比 RAGAS 文档不够详细

适用场景：

✅ LlamaIndex 用户
✅ 快速集成评估
❌ 非 LlamaIndex 框架（不如 RAGAS 通用）

文档: https://docs.llamaindex.ai/en/stable/module_guides/evaluating/

5. DeepEval

核心特点：专注于单元测试风格的 LLM 评估，类似 pytest 的体验。

使用风格：

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# 定义测试用例
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France.",
    retrieval_context=["Paris is the capital..."]
)

# 类似单元测试的断言
assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8)
])

优点：

✅ 开发者友好：pytest 风格，CI/CD 集成方便
✅ 丰富的指标：14+ 内置评估指标
✅ 合成数据生成：自动生成测试用例
✅ 快速迭代：适合开发阶段

缺点：

❌ 相对较新：社区较小
❌ 文档有限：相比 RAGAS 不够成熟
❌ 仍依赖 LLM：评估成本与 RAGAS 类似

适用场景：

✅ 开发阶段测试
✅ CI/CD 集成
✅ 习惯 pytest 的团队
❌ 生产监控（不如 TruLens）

GitHub: https://github.com/confident-ai/deepeval Stars: 3k+

6. 传统指标（BLEU, ROUGE, BERTScore）

常用指标：

指标	计算方式	优点	缺点
BLEU	N-gram 重叠	快速、确定性	忽略语义、对释义不敏感
ROUGE	召回导向的 N-gram	适合摘要任务	同 BLEU，过于表面
METEOR	考虑同义词的匹配	比 BLEU 更语义	仍然不够智能
BERTScore	使用 BERT embeddings	捕捉语义相似度	需要 GPU、不评估事实性

使用示例：

from rouge import Rouge
from bert_score import score as bert_score

# ROUGE
rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)
# {'rouge-1': {'f': 0.75, 'p': 0.80, 'r': 0.70}}

# BERTScore
P, R, F1 = bert_score([hypothesis], [reference], lang="en")
# F1: 0.8523

优点：

✅ 快速：无需 LLM API 调用
✅ 确定性：相同输入相同输出
✅ 成本零：完全本地计算
✅ 成熟稳定：学术界使用多年

缺点：

❌ 不评估事实性：无法检测幻觉
❌ 表面匹配：释义答案会得低分
❌ 需要精确 reference：对 ground truth 要求高
❌ 不适合 RAG：主要为机器翻译/摘要设计

适用场景：

✅ 快速基线对比
✅ 大规模评估（成本敏感）
✅ 结合人工评估使用
❌ 作为唯一评估标准（不够全面）

评估方法综合对比

对比矩阵

评估方法	全面性	自动化程度	成本	速度	准确性	易用性	推荐指数
RAGAS	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
ARES	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
TruLens	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
LlamaIndex Eval	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
DeepEval	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
BLEU/ROUGE	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐
BERTScore	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
人工评估	⭐⭐⭐⭐⭐	⭐	⭐	⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐

说明：

⭐ 越多越好
成本：⭐ = 最贵，⭐⭐⭐⭐⭐ = 免费/最便宜

成本对比（评估 1000 个问题）

方法	成本估算	时间估算	备注
RAGAS (GPT-4o-mini)	$10-20	30-60 分钟	依赖 API 并发限制
RAGAS (自托管 Qwen)	$0（GPU 成本）	1-2 小时	需要本地 GPU
ARES	$50-100（训练）+ $0（推理）	训练 2-4h + 推理 10min	训练一次，长期使用
TruLens (GPT-4o-mini)	$10-20	30-60 分钟	类似 RAGAS
BLEU/ROUGE	$0	< 5 分钟	完全本地
BERTScore	$0	10-20 分钟	需要 GPU 加速
人工评估	$1000-5000	20-50 小时	$10-50/小时 × 20-100h

学术界和工业界的标准

学术研究中的评估方法

主流论文使用的评估组合：

1. 检索质量（必有）
   - Recall@K
   - MRR (Mean Reciprocal Rank)
   - NDCG

2. 生成质量（至少一种）
   - BLEU / ROUGE（baseline）
   - BERTScore（语义）
   - 人工评估（小规模）

3. 端到端质量（越来越常见）
   - RAGAS（最近的论文）
   - 自定义 LLM-based 评估
   - 任务成功率（QA 任务用 EM/F1）

4. 特定任务指标
   - QA: Exact Match, F1
   - 摘要: ROUGE, Factuality
   - 对话: BLEU, Coherence

顶会论文趋势（EMNLP, ACL, NeurIPS）：

2021-2022：主要使用 BLEU/ROUGE + 人工评估
2023-2024：越来越多使用 LLM-based 评估（RAGAS 或自研）
2024+：多种方法结合，强调 faithfulness 和 hallucination 检测

工业界的实践

大公司的评估策略（基于公开信息）：

Google / Microsoft / OpenAI:
├─ 内部自研评估系统（未开源）
├─ 大规模人工评估（RLHF 数据标注）
├─ A/B Testing（真实用户反馈）
└─ 结合传统指标（BLEU/ROUGE）+ LLM 评估

创业公司 / 中小团队:
├─ RAGAS（最常见）
├─ TruLens（需要可观测性的团队）
├─ LlamaIndex Eval（LlamaIndex 用户）
└─ 人工抽查（关键场景）

行业报告（Gartner, Forrester）：

没有推荐单一评估标准
建议"多层次评估"：自动化 + 人工 + 真实用户反馈
强调"业务指标"优先（用户满意度、任务成功率等）

LightRAG 的建议

当前状态

LightRAG 已集成 RAGAS，这是一个很好的选择，因为：

✅ RAGAS 是目前最流行的开源 RAG 评估框架 ✅ 易于使用，文档完善 ✅ 社区活跃，持续更新 ✅ 成本可控

改进建议

短期（立即可做）

添加传统指标作为补充
```
# 在 lightrag/evaluation/ 添加
- bleu_rouge_eval.py  # BLEU/ROUGE 评估
- bertscore_eval.py   # BERTScore 评估
```
理由：
- 提供快速、免费的基线对比
- 不依赖外部 API
- 对于成本敏感的用户很有价值
添加检索质量评估
```
# lightrag/evaluation/retrieval_eval.py
- Recall@K
- MRR
- NDCG
```
理由：
- 单独评估检索质量
- 帮助定位是检索问题还是生成问题

人工评估指南

# docs/HumanEvaluationGuide-zh.md
- 如何设计人工评估
- 评估标准和量表
- 如何结合自动评估

中期（1-3 个月）

集成 TruLens（可选）

# lightrag/evaluation/trulens_integration.py
- 提供可视化评估界面
- 实时监控选项

理由：

为需要可观测性的用户提供选择
生产环境监控能力

支持自定义评估函数

# lightrag/evaluation/custom_eval.py
class CustomEvaluator:
    def evaluate(self, question, answer, contexts):
        # 用户自定义逻辑
        return score

理由：

不同用户有不同的质量标准
领域特定的评估需求

长期（未来考虑）

探索 ARES 集成
```
# lightrag/evaluation/ares_integration.py
- 合成数据生成
- 轻量级评估模型训练
```
理由：
- 为大规模用户提供低成本选项
- 需要较大的开发和维护成本

评估方法选择决策树

                      开始
                       │
                       ▼
              是否有 Ground Truth？
              ├─ 是 ──────────────┐
              │                    │
              └─ 否 ──────────────┐│
                                  ││
                                  ▼▼
                          评估预算如何？
                          ├─ 充足（$100+）─→ RAGAS + 人工评估
                          ├─ 中等（$10-100）→ RAGAS
                          └─ 有限（<$10）──→ BLEU/ROUGE + 人工抽查
                                  │
                                  ▼
                          是否需要实时监控？
                          ├─ 是 ──→ TruLens 或 自建监控
                          └─ 否 ──→ 批量 RAGAS 评估
                                  │
                                  ▼
                          评估量级如何？
                          ├─ 小（<1000/月）─→ RAGAS
                          ├─ 中（1k-10k/月）→ RAGAS + 采样
                          └─ 大（>10k/月）──→ ARES（训练后使用）
                                  │
                                  ▼
                          是否高风险应用？
                          ├─ 是 ──→ 自动评估 + 强制人工审核
                          └─ 否 ──→ 自动评估 + 定期抽查

总结

关键要点

RAGAS 不是标准，但是最佳选择之一
- 最流行的开源 RAG 评估框架
- 易用、全面、成本可控
- 但不是唯一选择
没有完美的评估方法
- 每种方法都有优缺点
- 推荐组合使用多种方法
评估策略应匹配场景
- 原型阶段：RAGAS + 少量人工
- 生产环境：RAGAS + 监控 + 用户反馈
- 高风险应用：多层次评估 + 强制人工审核
自动评估 ≠ 完全可信
- 自动评估提供量化基线
- 人工评估捕捉边缘案例
- 真实用户反馈是最终标准

对于 LightRAG 用户的建议

立即可做：

# 1. 使用 RAGAS（已集成）
python lightrag/evaluation/eval_rag_quality.py

# 2. 人工抽查 20-50 个关键问题
# （确保 RAGAS 评估与人类判断一致）

# 3. 记录基线分数
Baseline RAGAS: 0.85

# 4. 每次改动后重新评估
# 追踪趋势而非绝对值

长期实践：

1. 建立测试集（50-200 个代表性问题）
2. 每次重大改动前后评估
3. 定期更新测试集（新的边缘案例）
4. 结合用户反馈持续改进
5. 关键场景人工审核

参考资源

学术论文

RAGAS: "RAGAS: Automated Evaluation of Retrieval Augmented Generation" (2023)
ARES: "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems" (Stanford, 2024)
G-Eval: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (2023)

开源项目

RAGAS: https://github.com/explodinggradients/ragas
ARES: https://github.com/stanford-futuredata/ARES
TruLens: https://github.com/truera/trulens
DeepEval: https://github.com/confident-ai/deepeval
LlamaIndex: https://docs.llamaindex.ai/en/stable/module_guides/evaluating/

行业报告

"The State of RAG Evaluation" (Anthropic, 2024)
"Best Practices for LLM Evaluation" (OpenAI, 2024)
Gartner: "How to Evaluate Generative AI Applications" (2024)

需要我帮你：

实现其他评估方法（BLEU/ROUGE/BERTScore）？
设计人工评估流程？
对比 RAGAS 与其他方法的实际效果？

20 KiB

Raw Blame History

RAG 评估方法全面对比

快速回答

RAG 评估方法全景图

评估方法分类

主流评估框架对比

1. RAGAS（目前最流行）

2. ARES（Automated RAG Evaluation System）

3. TruLens（Observability & Evaluation）

4. LlamaIndex 评估模块

5. DeepEval

6. 传统指标（BLEU, ROUGE, BERTScore）

评估方法综合对比

对比矩阵

成本对比（评估 1000 个问题）

学术界和工业界的标准

学术研究中的评估方法

工业界的实践

推荐的评估策略

根据场景选择

场景 1：快速原型阶段

场景 2：生产环境准备

场景 3：生产环境运行

场景 4：大规模生产（高流量）

场景 5：高风险应用（医疗、法律、金融）

LightRAG 的建议

当前状态

改进建议

短期（立即可做）

中期（1-3 个月）

长期（未来考虑）

评估方法选择决策树

总结

关键要点

对于 LightRAG 用户的建议

参考资源

学术论文

开源项目

行业报告

20 KiB Raw Blame History Unescape Escape

RAG 评估方法全面对比

快速回答

RAG 评估方法全景图

评估方法分类

主流评估框架对比

1. RAGAS（目前最流行）

2. ARES（Automated RAG Evaluation System）

3. TruLens（Observability & Evaluation）

4. LlamaIndex 评估模块

5. DeepEval

6. 传统指标（BLEU, ROUGE, BERTScore）

评估方法综合对比

对比矩阵

成本对比（评估 1000 个问题）

学术界和工业界的标准

学术研究中的评估方法

工业界的实践

推荐的评估策略

根据场景选择

场景 1：快速原型阶段

场景 2：生产环境准备

场景 3：生产环境运行

场景 4：大规模生产（高流量）

场景 5：高风险应用（医疗、法律、金融）

LightRAG 的建议

当前状态

改进建议

短期（立即可做）

中期（1-3 个月）

长期（未来考虑）

评估方法选择决策树

总结

关键要点

对于 LightRAG 用户的建议

参考资源

学术论文

开源项目

行业报告

20 KiB

Raw Blame History