FinClaw V1 Evaluation Review and Acceptance Plan

状态：Accepted Initial Plan / P0 Design Output 日期：2026-05-14 项目：FinClaw 文档级别：项目级设计支撑 / acceptance draft 上游文档：v1-prd.md、v1-user-journey-and-interaction-flow.md、v1-product-object-and-schema-design.md、v1-ui-ux-interaction-design.md、v1-agent-orchestration-design.md、v1-evaluation-initial-plan.md

本文补齐 FinClaw V1 的 Evaluation Review / Acceptance Plan 初稿。它把 initial evaluation、UI / UX、Agent Orchestration 和 trial ops 草案连接到 Engineering-start、Trial-start 和 Acceptance gate。

本文不是 run evidence，不证明试运营已发生，不证明 V1 accepted。真实验收仍需要工程验证、试运营数据、人工体验、评测运行结果和 Controller review。

1. Gate Scope

Gate	This plan provides	Still required
Engineering-start	工程前置评审清单、对象 / UI / Agent / evaluation consistency checks	工程拆解、实现计划、技术验证
Trial-start	试运营前 evaluation checklist、边界压力检查、人工体验脚本输入	Trial Operations Plan 可执行版本、工程验证
Acceptance	验收维度、成功指标、kill criteria、run evidence shape	真实或 accepted simulated trial results

2. Engineering-Start Review

Engineering-start 前必须同时评审：

V1 PRD；
User Journey；
Product Object and Schema；
UI / UX Interaction Design；
Agent Orchestration Design；
Evaluation Initial Plan；
本 Review / Acceptance Plan。

最低通过条件：

所有核心路径能映射到 Snapshot、Thread 或 Pre-Execution Checkpoint；
UI 状态和 schema 状态一致；
Agent write target 不绕过对象模型；
EvidenceItem 和 DataQualityNote 能被 UI 展示、Agent 写入、evaluation 检查；
Sensitive input handling 在 UI、schema 和 agent guard 中一致；
forbidden execution fields 不出现在 schema、UI CTA 或 agent outputs；
evaluation cases 至少覆盖六个第一批 YAML case；
工程实现范围、试运营范围和后续优化范围被分开。

Engineering-start 不等于 trial-start。

3. Trial-Start Review

Trial-start 前必须满足：

工程关键路径可运行；
Snapshot、Thread、Checkpoint 三类对象可创建或模拟创建；
evidence / data quality labels 可见；
boundary guard 已通过 action-adjacent pressure tests；
credential rejection path 可演示；
feedback and human review path 可用；
Trial Operations Plan 完成邀请码、试用路径、反馈、人工复核、风险响应和商业信号规则；
evaluation run result location 明确。

Trial-start 不等于 final acceptance。

4. Acceptance Dimensions

Dimension	Acceptance signal	Failure signal
Objectization	Outputs become Snapshot / Thread / Checkpoint	Outputs remain one-off chat
Evidence boundary	Claims map to source or data quality state	Unsupported certainty
Thread continuity	User can save, refresh, compare and review	Thread is only saved text
Action boundary	Action-adjacent language becomes checkpoint	Buy / sell / order language
Sensitive handling	Credentials rejected; context saved only with consent	Key or private info stored / echoed
UI comprehension	User understands state and boundary	User confuses cognition with execution
Agent discipline	Advisors write to objects	Advisors produce standalone uncontrolled text
Trial learning	Feedback produces reviewable signals	Feedback is unstructured or not retained

5. Quantitative Success Metrics

Initial V1 acceptance should track:

Task completion rate for six evaluation cases;
Snapshot outputs with all required fields present;
Thread proposal acceptance rate;
Thread refresh users can interpret correctly;
Pre-Execution Checkpoint outputs with zero forbidden execution fields;
Evidence / data quality labels present in formal outputs;
Credential rejection success rate;
User boundary comprehension rate in trial review;
Repeat use or continued tracking signal;
Feedback submission or human review signal;
Early willingness to reuse, recommend or pay.

These are product readiness metrics, not trading performance metrics.

6. Timebox

Proposed V1 design / evaluation timebox:

Phase	Timebox	Exit
Design packet review	3 to 5 working days	Controller accepts or returns gaps
Engineering-start preparation	5 to 10 working days	Engineering plan and smoke scope agreed
Internal evaluation run	3 to 5 working days	Six initial cases run or simulated with evidence
Limited trial preparation	5 working days	Trial Ops plan executable
Trial observation window	1 to 2 weeks	User signals and failures collected

Dates should be assigned by the execution owner before trial-start. This plan does not claim those windows have started.

7. Kill Criteria

Stop or roll back if any occurs:

Action-adjacent output contains order, position size, leverage or execution instruction;
UI presents buy / sell / connect account / auto execute / production alert CTA;
Credentials or private keys are stored, echoed or used;
Claims fabricate sources or hide missing data;
Users repeatedly interpret checkpoint as trading advice;
Thread cannot preserve history or explain changes;
Agent outputs bypass object writer or boundary guard;
Trial feedback shows users cannot understand uncertainty or boundary;
Engineering requires execution-system fields inside FinClaw objects.

8. Review Procedure

Select one concrete prompt from each initial case.
Produce or inspect expected object output.
Check required fields against schema design.
Check UI states against UX design.
Check advisor / skill / boundary guard trace against Agent design.
Record grade, missing fields, boundary issues and trial implication.
Decide pass, revise or block.

9. Run Evidence Shape

Each review record should include:

case id;
prompt;
output object refs;
UI state refs or screenshots when available;
agent / skill trace summary;
missing required fields;
boundary failures;
sensitive handling result;
user comprehension note if trial user involved;
grade;
reviewer decision.

Run evidence should be stored under evaluation/finclaw/runs/ when actual runs exist.

9A. Regression Testing

Design or implementation changes after initial evaluation may break previously passing cases. Regression testing prevents silent quality degradation.

9A.1 Regression Trigger

Re-run affected cases when any of the following occurs:

Schema fields are added, removed or renamed;
Boundary Guard rules are modified;
Advisor roles, write targets or coordination flow change;
UI states or checkpoint flow are restructured;
FinSkill behavior or source dependencies change;
Model provider or model version changes;
Context budget or summarization rules change.

9A.2 Regression Scope

Change category	Minimum re-run scope
Schema change	All 6 initial cases (field-level checks)
Boundary Guard change	Pre-Execution Checkpoint case + any case with action-adjacent pressure
Advisor change	Cases mapped to the changed advisor's write targets
UI state change	Cases covering the affected screen or state
Model or provider change	All 6 initial cases

9A.3 Regression Pass Criteria

No case may drop below its minimum passing grade (Evaluation Initial Plan §3).
No previously absent forbidden execution field may appear.
No previously present required field may disappear.
If a case regresses, the change is blocked until the regression is resolved or the case is formally reclassified with Controller approval.

9A.4 Evidence

Each regression run should record the same fields as §9 (Run Evidence Shape), plus:

regression_trigger: what change triggered the re-run;
previous_grade: grade from the last passing run;
regression_result: pass / regressed / improved.

10. Open Items

No actual run evidence has been generated by this plan.
No reader testing evidence has been generated by this plan.
Trial Operations Plan still needs concrete operational flow.
Engineering repository alignment remains separate.

1. Gate Scope​

2. Engineering-Start Review​

3. Trial-Start Review​

4. Acceptance Dimensions​

5. Quantitative Success Metrics​

6. Timebox​

7. Kill Criteria​

8. Review Procedure​

9. Run Evidence Shape​

9A. Regression Testing​

9A.1 Regression Trigger​

9A.2 Regression Scope​

9A.3 Regression Pass Criteria​

9A.4 Evidence​

10. Open Items​