跳到主要内容

FinClaw V1 Evaluation Review and Acceptance Plan

状态:Accepted Initial Plan / P0 Design Output 日期:2026-05-14 项目:FinClaw 文档级别:项目级设计支撑 / acceptance draft 上游文档:v1-prd.mdv1-user-journey-and-interaction-flow.mdv1-product-object-and-schema-design.mdv1-ui-ux-interaction-design.mdv1-agent-orchestration-design.mdv1-evaluation-initial-plan.md

本文补齐 FinClaw V1 的 Evaluation Review / Acceptance Plan 初稿。它把 initial evaluation、UI / UX、Agent Orchestration 和 trial ops 草案连接到 Engineering-start、Trial-start 和 Acceptance gate。

本文不是 run evidence,不证明试运营已发生,不证明 V1 accepted。真实验收仍需要工程验证、试运营数据、人工体验、评测运行结果和 Controller review。

1. Gate Scope

GateThis plan providesStill required
Engineering-start工程前置评审清单、对象 / UI / Agent / evaluation consistency checks工程拆解、实现计划、技术验证
Trial-start试运营前 evaluation checklist、边界压力检查、人工体验脚本输入Trial Operations Plan 可执行版本、工程验证
Acceptance验收维度、成功指标、kill criteria、run evidence shape真实或 accepted simulated trial results

2. Engineering-Start Review

Engineering-start 前必须同时评审:

  • V1 PRD;
  • User Journey;
  • Product Object and Schema;
  • UI / UX Interaction Design;
  • Agent Orchestration Design;
  • Evaluation Initial Plan;
  • 本 Review / Acceptance Plan。

最低通过条件:

  1. 所有核心路径能映射到 Snapshot、Thread 或 Pre-Execution Checkpoint;
  2. UI 状态和 schema 状态一致;
  3. Agent write target 不绕过对象模型;
  4. EvidenceItem 和 DataQualityNote 能被 UI 展示、Agent 写入、evaluation 检查;
  5. Sensitive input handling 在 UI、schema 和 agent guard 中一致;
  6. forbidden execution fields 不出现在 schema、UI CTA 或 agent outputs;
  7. evaluation cases 至少覆盖六个第一批 YAML case;
  8. 工程实现范围、试运营范围和后续优化范围被分开。

Engineering-start 不等于 trial-start。

3. Trial-Start Review

Trial-start 前必须满足:

  • 工程关键路径可运行;
  • Snapshot、Thread、Checkpoint 三类对象可创建或模拟创建;
  • evidence / data quality labels 可见;
  • boundary guard 已通过 action-adjacent pressure tests;
  • credential rejection path 可演示;
  • feedback and human review path 可用;
  • Trial Operations Plan 完成邀请码、试用路径、反馈、人工复核、风险响应和商业信号规则;
  • evaluation run result location 明确。

Trial-start 不等于 final acceptance。

4. Acceptance Dimensions

DimensionAcceptance signalFailure signal
ObjectizationOutputs become Snapshot / Thread / CheckpointOutputs remain one-off chat
Evidence boundaryClaims map to source or data quality stateUnsupported certainty
Thread continuityUser can save, refresh, compare and reviewThread is only saved text
Action boundaryAction-adjacent language becomes checkpointBuy / sell / order language
Sensitive handlingCredentials rejected; context saved only with consentKey or private info stored / echoed
UI comprehensionUser understands state and boundaryUser confuses cognition with execution
Agent disciplineAdvisors write to objectsAdvisors produce standalone uncontrolled text
Trial learningFeedback produces reviewable signalsFeedback is unstructured or not retained

5. Quantitative Success Metrics

Initial V1 acceptance should track:

  • Task completion rate for six evaluation cases;
  • Snapshot outputs with all required fields present;
  • Thread proposal acceptance rate;
  • Thread refresh users can interpret correctly;
  • Pre-Execution Checkpoint outputs with zero forbidden execution fields;
  • Evidence / data quality labels present in formal outputs;
  • Credential rejection success rate;
  • User boundary comprehension rate in trial review;
  • Repeat use or continued tracking signal;
  • Feedback submission or human review signal;
  • Early willingness to reuse, recommend or pay.

These are product readiness metrics, not trading performance metrics.

6. Timebox

Proposed V1 design / evaluation timebox:

PhaseTimeboxExit
Design packet review3 to 5 working daysController accepts or returns gaps
Engineering-start preparation5 to 10 working daysEngineering plan and smoke scope agreed
Internal evaluation run3 to 5 working daysSix initial cases run or simulated with evidence
Limited trial preparation5 working daysTrial Ops plan executable
Trial observation window1 to 2 weeksUser signals and failures collected

Dates should be assigned by the execution owner before trial-start. This plan does not claim those windows have started.

7. Kill Criteria

Stop or roll back if any occurs:

  • Action-adjacent output contains order, position size, leverage or execution instruction;
  • UI presents buy / sell / connect account / auto execute / production alert CTA;
  • Credentials or private keys are stored, echoed or used;
  • Claims fabricate sources or hide missing data;
  • Users repeatedly interpret checkpoint as trading advice;
  • Thread cannot preserve history or explain changes;
  • Agent outputs bypass object writer or boundary guard;
  • Trial feedback shows users cannot understand uncertainty or boundary;
  • Engineering requires execution-system fields inside FinClaw objects.

8. Review Procedure

  1. Select one concrete prompt from each initial case.
  2. Produce or inspect expected object output.
  3. Check required fields against schema design.
  4. Check UI states against UX design.
  5. Check advisor / skill / boundary guard trace against Agent design.
  6. Record grade, missing fields, boundary issues and trial implication.
  7. Decide pass, revise or block.

9. Run Evidence Shape

Each review record should include:

  • case id;
  • prompt;
  • output object refs;
  • UI state refs or screenshots when available;
  • agent / skill trace summary;
  • missing required fields;
  • boundary failures;
  • sensitive handling result;
  • user comprehension note if trial user involved;
  • grade;
  • reviewer decision.

Run evidence should be stored under evaluation/finclaw/runs/ when actual runs exist.

9A. Regression Testing

Design or implementation changes after initial evaluation may break previously passing cases. Regression testing prevents silent quality degradation.

9A.1 Regression Trigger

Re-run affected cases when any of the following occurs:

  • Schema fields are added, removed or renamed;
  • Boundary Guard rules are modified;
  • Advisor roles, write targets or coordination flow change;
  • UI states or checkpoint flow are restructured;
  • FinSkill behavior or source dependencies change;
  • Model provider or model version changes;
  • Context budget or summarization rules change.

9A.2 Regression Scope

Change categoryMinimum re-run scope
Schema changeAll 6 initial cases (field-level checks)
Boundary Guard changePre-Execution Checkpoint case + any case with action-adjacent pressure
Advisor changeCases mapped to the changed advisor's write targets
UI state changeCases covering the affected screen or state
Model or provider changeAll 6 initial cases

9A.3 Regression Pass Criteria

  • No case may drop below its minimum passing grade (Evaluation Initial Plan §3).
  • No previously absent forbidden execution field may appear.
  • No previously present required field may disappear.
  • If a case regresses, the change is blocked until the regression is resolved or the case is formally reclassified with Controller approval.

9A.4 Evidence

Each regression run should record the same fields as §9 (Run Evidence Shape), plus:

  • regression_trigger: what change triggered the re-run;
  • previous_grade: grade from the last passing run;
  • regression_result: pass / regressed / improved.

10. Open Items

  • No actual run evidence has been generated by this plan.
  • No reader testing evidence has been generated by this plan.
  • Trial Operations Plan still needs concrete operational flow.
  • Engineering repository alignment remains separate.