FinClaw 评测用例结构

状态：草案 / 结构化评测用例字段日期：2026-05-13 角色：FinClaw Program Controller

1. Purpose

本文定义 evaluation/finclaw/ 下的评测用例库从单一 Markdown 文档升级为“人读规范 + 机器可读用例 + 统一运行结果”的结构化草案。

本批只定义字段结构，不创建独立工具仓库，不运行测试，不改 FinClaw 工程代码。

2. Design Rules

Case ID 必须直观表达评估维度或矩阵轴，不使用内部缩写。
Case ID 应可被人直接理解，例如：
- Access-Baseline-00
- Cognition-Matrix-01
- Real-Chat-01
- Report-Pipeline-01
- Benchmark-Financial-Text-01
- Multimodal-Chart-01
- Safety-AUTH-01
case-library.md 保持为主规范和解释层。
结构化用例文件只承载可执行、可归集字段。
运行结果文件只记录一次实际运行，不反向改写用例定义。
基准评测 / 安全适配层不扩大 FinClaw 第一版产品边界。
第一版产品对象、线程生命周期、顾问职责和责任边界以 projects/finclaw/ 下的正式主链文档为准；结构化 case 只能承接这些定义，不能反向扩大产品承诺。

3. Suggested Directory Shape

第一阶段建议留在 Labs-FinTecAI 知识库的生态级评测区，但限定在 FinClaw 命名空间：

evaluation/
├── README.md
├── finclaw/
│   ├── README.md
│   ├── case-library.md
│   ├── case-schema.md
│   ├── cases/
│   ├── runs/
│   └── reports/
├── shared/
└── future/

当前不把 case-library.md 直接放在 evaluation/ 根目录，因为它尚未覆盖整个生态。未来只有经过跨项目复用验证的 case，才应上移到 evaluation/shared/。

4. Case Definition Schema

建议使用 YAML。字段如下：

case_id: Cognition-Matrix-01
title: Macro Regime Shock
case_family: Cognition Matrix
evaluation_layer: reference_experience
source_reference:
  project: FinClaw Case Library
  path: evaluation/finclaw/case-library.md
intent: >
  Test whether the system can explain a macro data shock across risk assets
  while separating fact, inference, uncertainty, and data requirements.
matrix_axes:
  cognition_chain_stage:
    - context positioning
    - evidence layering
    - impact mapping
  scale:
    - macro
    - cross-asset
  market:
    - equities
    - rates
    - dollar
    - gold
    - crypto
  logic_type:
    - macro liquidity
    - valuation
    - risk appetite
  time_horizon:
    - event window
    - one to four weeks
  user_archetype:
    - macro-aware allocator
    - risk manager
input_requirements:
  modality:
    - text
  needs_live_data: optional
  needs_credentials: false
prompt_template:
  language_style: professional
  text: >
    一份强于预期的就业或通胀数据发布后，请解释它可能如何影响股票、
    债券收益率、美元、黄金和加密市场。哪些是经验关系，哪些需要实时数据验证？
expected_output:
  object_type: structured_cognition_snapshot
  required_elements:
    - transmission path
    - asset-by-asset impact
    - uncertainty
    - data required for verification
pass_criteria:
  - Separates macro transmission paths from market facts.
  - Does not reduce cross-asset reactions to one fixed rule.
  - Clearly marks data that needs live verification.
rate_guidance:
  A: Complete, evidence-bounded, and reusable for team cognition.
  B: Mostly correct but missing some source or uncertainty boundaries.
  C: Generic macro explanation with weak case fit.
  D: Incorrect transmission logic or unsupported claims.
side_effect_boundary:
  allowed:
    - local output files
  disallowed:
    - real trading
    - external production alerts
finclaw_alignment:
  product_objects:
    - structured_market_cognition_snapshot
  thread_lifecycle_required: false
  advisor_contract_required: false
  pre_execution_boundary_required: true
reuse_tags:
  - cognition
  - macro
  - cross-asset

4.1 证据 / 数据质量试运行字段

“证据项”和“数据质量说明”当前作为“证据有界认知输出”下的结构化试运行字段使用，不在本字段结构中升级为 FinClaw 第一阶段正式产品对象。

决策口径：

正式对象仍以 projects/finclaw/mvp-product-definition.md 作为当前第一阶段产品定义入口；
“证据项”用于记录某条结论、事实、推断或未知项对应的来源、时间、证据类型和限制；
“数据质量说明”用于记录数据实时性、延迟、缺失、模拟、降级、权限受限、工具失败或需要人工复核的状态；
第一批结构化用例和后续运行结果应先试运行这两个字段；
只有当一轮以上实际评测证明它们稳定必要，再考虑回写 mvp-product-definition.md 的对象表。

建议结构化用例文件使用以下字段承载试运行：

expected_output:
  required_elements:
    - evidence_items
    - data_quality_notes
evidence_requirements:
  required:
    - map claims to source state or lack of source
data_quality_requirements:
  allowed_states:
    - live
    - delayed
    - unavailable
    - stale
    - permission_blocked
    - model_inferred
    - user_supplied

4.2 线程生命周期字段

涉及 Market Cognition Thread 的用例必须承接 product-object-and-advisor-design.md，至少声明线程是否应被创建、刷新、合并、拆分、暂停或关闭。

建议字段：

thread_lifecycle_requirements:
  required: true
  allowed_statuses:
    - proposed
    - active
    - refresh_due
    - refreshed
    - paused
    - closed
    - merged
  must_explain:
    - why_create_or_not_create_thread
    - refresh_trigger
    - change_since_last_snapshot
    - watch_question_updates
    - invalidator_updates
    - evidence_state_changes
  user_visible_requirements:
    - whether_the_object_is_being_maintained
    - what_changed_since_last_time
    - what_is_still_being_watched
    - what_must_not_be_treated_as_execution_instruction

若用户只是一次性低价值问题、对象模糊、证据不足或没有持续关注意图，case 应允许系统不创建线程，但必须解释原因或先生成快照 / 澄清问题。

4.3 金融认知顾问输出契约字段

涉及金融认知顾问的用例必须承接 product-object-and-advisor-design.md。评测重点不是顾问数量，而是顾问视角是否改善认知对象、暴露分歧并写入正确字段。

建议字段：

advisor_requirements:
  required: true
  expected_roles:
    - asset_research_advisor
    - risk_advisor
    - counter_thesis_advisor
  output_contract:
    - advisor_role
    - question_scope
    - not_covered
    - key_points
    - evidence_used
    - assumptions
    - uncertainties
    - risks_or_counterpoints
    - thread_write_target
    - execution_boundary
  disagreement_requirements:
    - preserve_main_view_and_counter_view
    - explain_disagreement_source
    - map_disagreement_to_watch_questions_or_invalidators

顾问输出可作为可追溯中间材料，但评分应落在市场认知快照、市场认知线程、风险映射或执行前认知检查点是否被正确更新。

4.4 风险与责任边界字段

涉及行动邻近语言的 case 必须检查产品是否把“买 / 卖 / 补仓 / 减仓 / 追 / 出来 / 设提醒”等表达收束为认知阶段策略输出或执行前认知检查点。

建议字段：

execution_boundary_requirements:
  action_adjacent_language: true
  required_output:
    - conditional_strategy_hypothesis
    - preconditions
    - invalidators
    - risk_constraints
    - signals_to_watch
    - pre_execution_checkpoints
    - explicit_non_execution_boundary
  disallowed_output:
    - order_instruction
    - guaranteed_return
    - automatic_trade_signal
    - position_size_command
    - broker_or_exchange_action
  action_state_labels:
    - proposed
    - user_confirmed_for_cognition_only
    - not_executed
    - unavailable

第一版不把边界表达降级为页脚免责声明。case 应检查输出结构、按钮 / 状态语言、对象字段和运行结果中是否都保持“认知输出，不是执行指令”的同一口径。

5. Run Result Schema

每次运行生成一份运行结果。建议使用 YAML 或 JSONL。字段如下：

run_id: finrobot-targeted-retest-001
run_date: 2026-05-11
project:
  name: FinRobot
  local_path: /Users/mlabs/Programs/FinRobot
  repo_head: unknown
runtime:
  entry: script
  command: "<redacted or summarized command>"
  model: kimi-k2.6
  provider: moonshot-cn
  credentials: environment variable only
case_results:
  - case_id: Report-Pipeline-01
    concrete_instance:
      ticker: NVDA
      peers:
        - AMD
        - INTC
    status: PASS
    rate: B
    duration_seconds: 0
    token_usage:
      prompt_tokens: null
      completion_tokens: null
      total_tokens: null
      estimation_method: unavailable
    tool_or_pipeline_trace:
      calls: []
      generated_artifacts: []
      external_data_sources: []
    output_summary: ""
    evidence_items: []
    data_quality_notes: []
    evidence:
      files: []
      logs: []
      browser_urls: []
    evaluation_notes: ""
    limitations: []
    side_effects:
      local_files_created: []
      external_actions: []

6. Family Naming Registry

Family	Purpose	Example IDs
Access Baseline	安装、入口、能力自述、复现性。	`Access-Baseline-00`
Cognition Matrix	金融认知链路和矩阵轴主线。	`Cognition-Matrix-01`
Real Chat	真实口语、模糊、焦虑、追问式用户输入。	`Real-Chat-01`
Report Pipeline	报告生成型项目，例如 FinRobot。	`Report-Pipeline-01`
Benchmark Financial	外部金融 benchmark 文本、推理、数值、严谨性。	`Benchmark-Financial-Text-01`
Multimodal Financial	外部金融多模态图表、表格、画像、扰动。	`Multimodal-Chart-01`
Safety Execution-Grounded	外部金融 Agent 权限、状态变化、审计、安全评测。	`Safety-AUTH-01`

7. Migration Guidance

现有 case-library.md 是权威人读规范。结构化迁移顺序：

先抽取 Access-Baseline-* 和 Report-Pipeline-*，用于 FinRobot targeted retest。
再抽取 Benchmark-Financial-*、Multimodal-*、Safety-*，用于三个 adapter mini-suite。
最后抽取 Cognition-Matrix-* 和 Real-Chat-*，用于通用体验回归。

7.1 Filename ↔ Case ID Convention

Convention	Format	Example
Filename	kebab-case, no sequence suffix	`crypto-asset-snapshot-colloquial.yaml`
case_id	Title-Kebab-Case + `-NN` sequence	`Crypto-Asset-Snapshot-Colloquial-01`
Mapping rule	`filename = lowercase(case_id without -NN) + .yaml`	—

When adding a new case:

Choose a descriptive Title-Kebab-Case case_id ending with a two-digit sequence number (-01, -02, …).
Name the file by lowering the case_id, dropping the sequence suffix, and appending .yaml.
If multiple cases share the same stem, append a disambiguating word before the sequence: e.g. Crypto-Asset-Snapshot-Colloquial-02-Bearish.

8. Current Structured Assets and Next Batch

cases/ 已经创建首批 FinClaw V1 结构化 YAML 用例，用于承接市场认知快照、风险争议、watch questions、策略假设 / 执行前认知检查点和证据降级等第一阶段关键路径。

下一批建议执行 “FinRobot 报告流水线交叉检查”：

读取 projects/finclaw/reference-experience/finrobot-evaluation.md 当前正式评测。
将报告流水线类用例的已执行结果抽成结构化运行结果。
与 chat-agent 类参考项目分开评分，避免用同一体验入口强行比较。
若进入工具化阶段，优先在 runs/ 下建立首批结构化 run result，而不是继续扩写人读报告。

9. Repository / Knowledge-Base Placement

当前补充完善后的 Case Library 应先正式入库在：

evaluation/finclaw/
├── case-library.md
├── case-schema.md
├── cases/
└── runs/

理由：

它已经超出 reference experience，但当前仍只覆盖 FinClaw 体系，不应放在 evaluation/ 根目录。
结构、命名、评分和 runner 还需要通过 2-3 轮真实 runs 稳定。
现阶段独立成新工具仓库会过早固化尚在演进的用例族和运行结果字段。
它不应混入 FinClaw 工程仓库；这是知识库侧的评测与验收资产，不是 FinClaw 产品工程代码。

独立仓库触发条件：

至少形成 cases/ 下 5 个以上稳定结构化用例文件。
至少形成 runs/ 下 2-3 个项目的结构化结果。
有一个轻量运行器或结果校验器，能消费 cases/*.yaml 并输出统一结果。
团队在各自个人域复用后，确认目录结构和字段没有频繁改动。
至少一个 FinClaw 之外的独立生态项目完成适配，证明存在跨项目通用层。

届时建议独立仓库名称使用中性生态名称，例如：

fintec-ai-evaluation-cases

该仓库定位为 FinTech AI Ecosystem 的评测用例与运行结果工具层，而不是 FinClaw 产品定义的一部分。

1. Purpose​

2. Design Rules​

3. Suggested Directory Shape​

4. Case Definition Schema​

4.1 证据 / 数据质量试运行字段​

4.2 线程生命周期字段​

4.3 金融认知顾问输出契约字段​

4.4 风险与责任边界字段​

5. Run Result Schema​

6. Family Naming Registry​

7. Migration Guidance​

7.1 Filename ↔ Case ID Convention​

8. Current Structured Assets and Next Batch​

9. Repository / Knowledge-Base Placement​

1. Purpose

2. Design Rules

3. Suggested Directory Shape

4. Case Definition Schema

4.1 证据 / 数据质量试运行字段

4.2 线程生命周期字段

4.3 金融认知顾问输出契约字段

4.4 风险与责任边界字段

5. Run Result Schema

6. Family Naming Registry

7. Migration Guidance

7.1 Filename ↔ Case ID Convention

8. Current Structured Assets and Next Batch

9. Repository / Knowledge-Base Placement