FinClaw V1 Evaluation Review and Acceptance Plan
状态:Accepted Initial Plan / P0 Design Output 日期:2026-05-14 项目:FinClaw 文档级别:项目级设计支撑 / acceptance draft 上游文档:v1-prd.md、v1-user-journey-and-interaction-flow.md、v1-product-object-and-schema-design.md、v1-ui-ux-interaction-design.md、v1-agent-orchestration-design.md、v1-evaluation-initial-plan.md
本文补齐 FinClaw V1 的 Evaluation Review / Acceptance Plan 初稿。它把 initial evaluation、UI / UX、Agent Orchestration 和 trial ops 草案连接到 Engineering-start、Trial-start 和 Acceptance gate。
本文不是 run evidence,不证明试运营已发生,不证明 V1 accepted。真实验收仍需要工程验证、试运营数据、人工体验、评测运行结果和 Controller review。
1. Gate Scope
| Gate | This plan provides | Still required |
|---|---|---|
| Engineering-start | 工程前置评审清单、对象 / UI / Agent / evaluation consistency checks | 工程拆解、实现计划、技术验证 |
| Trial-start | 试运营前 evaluation checklist、边界压力检查、人工体验脚本输入 | Trial Operations Plan 可执行版本、工程验证 |
| Acceptance | 验收维度、成功指标、kill criteria、run evidence shape | 真实或 accepted simulated trial results |
2. Engineering-Start Review
Engineering-start 前必须同时评审:
- V1 PRD;
- User Journey;
- Product Object and Schema;
- UI / UX Interaction Design;
- Agent Orchestration Design;
- Evaluation Initial Plan;
- 本 Review / Acceptance Plan。
最低通过条件:
- 所有核心路径能映射到 Snapshot、Thread 或 Pre-Execution Checkpoint;
- UI 状态和 schema 状态一致;
- Agent write target 不绕过对象模型;
- EvidenceItem 和 DataQualityNote 能被 UI 展示、Agent 写入、evaluation 检查;
- Sensitive input handling 在 UI、schema 和 agent guard 中一致;
- forbidden execution fields 不出现在 schema、UI CTA 或 agent outputs;
- evaluation cases 至少覆盖六个第一批 YAML case;
- 工程实现范围、试运营范围和后续优化范围被分开。
Engineering-start 不等于 trial-start。
3. Trial-Start Review
Trial-start 前必须满足:
- 工程关键路径可运行;
- Snapshot、Thread、Checkpoint 三类对象可创建或模拟创建;
- evidence / data quality labels 可见;
- boundary guard 已通过 action-adjacent pressure tests;
- credential rejection path 可演示;
- feedback and human review path 可用;
- Trial Operations Plan 完成邀请码、试用路径、反馈、人工复核、风险响应和商业信号规则;
- evaluation run result location 明确。
Trial-start 不等于 final acceptance。
4. Acceptance Dimensions
| Dimension | Acceptance signal | Failure signal |
|---|---|---|
| Objectization | Outputs become Snapshot / Thread / Checkpoint | Outputs remain one-off chat |
| Evidence boundary | Claims map to source or data quality state | Unsupported certainty |
| Thread continuity | User can save, refresh, compare and review | Thread is only saved text |
| Action boundary | Action-adjacent language becomes checkpoint | Buy / sell / order language |
| Sensitive handling | Credentials rejected; context saved only with consent | Key or private info stored / echoed |
| UI comprehension | User understands state and boundary | User confuses cognition with execution |
| Agent discipline | Advisors write to objects | Advisors produce standalone uncontrolled text |
| Trial learning | Feedback produces reviewable signals | Feedback is unstructured or not retained |
5. Quantitative Success Metrics
Initial V1 acceptance should track:
- Task completion rate for six evaluation cases;
- Snapshot outputs with all required fields present;
- Thread proposal acceptance rate;
- Thread refresh users can interpret correctly;
- Pre-Execution Checkpoint outputs with zero forbidden execution fields;
- Evidence / data quality labels present in formal outputs;
- Credential rejection success rate;
- User boundary comprehension rate in trial review;
- Repeat use or continued tracking signal;
- Feedback submission or human review signal;
- Early willingness to reuse, recommend or pay.
These are product readiness metrics, not trading performance metrics.
6. Timebox
Proposed V1 design / evaluation timebox:
| Phase | Timebox | Exit |
|---|---|---|
| Design packet review | 3 to 5 working days | Controller accepts or returns gaps |
| Engineering-start preparation | 5 to 10 working days | Engineering plan and smoke scope agreed |
| Internal evaluation run | 3 to 5 working days | Six initial cases run or simulated with evidence |
| Limited trial preparation | 5 working days | Trial Ops plan executable |
| Trial observation window | 1 to 2 weeks | User signals and failures collected |
Dates should be assigned by the execution owner before trial-start. This plan does not claim those windows have started.
7. Kill Criteria
Stop or roll back if any occurs:
- Action-adjacent output contains order, position size, leverage or execution instruction;
- UI presents buy / sell / connect account / auto execute / production alert CTA;
- Credentials or private keys are stored, echoed or used;
- Claims fabricate sources or hide missing data;
- Users repeatedly interpret checkpoint as trading advice;
- Thread cannot preserve history or explain changes;
- Agent outputs bypass object writer or boundary guard;
- Trial feedback shows users cannot understand uncertainty or boundary;
- Engineering requires execution-system fields inside FinClaw objects.
8. Review Procedure
- Select one concrete prompt from each initial case.
- Produce or inspect expected object output.
- Check required fields against schema design.
- Check UI states against UX design.
- Check advisor / skill / boundary guard trace against Agent design.
- Record grade, missing fields, boundary issues and trial implication.
- Decide pass, revise or block.
9. Run Evidence Shape
Each review record should include:
- case id;
- prompt;
- output object refs;
- UI state refs or screenshots when available;
- agent / skill trace summary;
- missing required fields;
- boundary failures;
- sensitive handling result;
- user comprehension note if trial user involved;
- grade;
- reviewer decision.
Run evidence should be stored under evaluation/finclaw/runs/ when actual runs exist.
9A. Regression Testing
Design or implementation changes after initial evaluation may break previously passing cases. Regression testing prevents silent quality degradation.
9A.1 Regression Trigger
Re-run affected cases when any of the following occurs:
- Schema fields are added, removed or renamed;
- Boundary Guard rules are modified;
- Advisor roles, write targets or coordination flow change;
- UI states or checkpoint flow are restructured;
- FinSkill behavior or source dependencies change;
- Model provider or model version changes;
- Context budget or summarization rules change.
9A.2 Regression Scope
| Change category | Minimum re-run scope |
|---|---|
| Schema change | All 6 initial cases (field-level checks) |
| Boundary Guard change | Pre-Execution Checkpoint case + any case with action-adjacent pressure |
| Advisor change | Cases mapped to the changed advisor's write targets |
| UI state change | Cases covering the affected screen or state |
| Model or provider change | All 6 initial cases |
9A.3 Regression Pass Criteria
- No case may drop below its minimum passing grade (Evaluation Initial Plan §3).
- No previously absent forbidden execution field may appear.
- No previously present required field may disappear.
- If a case regresses, the change is blocked until the regression is resolved or the case is formally reclassified with Controller approval.
9A.4 Evidence
Each regression run should record the same fields as §9 (Run Evidence Shape), plus:
regression_trigger: what change triggered the re-run;previous_grade: grade from the last passing run;regression_result:pass/regressed/improved.
10. Open Items
- No actual run evidence has been generated by this plan.
- No reader testing evidence has been generated by this plan.
- Trial Operations Plan still needs concrete operational flow.
- Engineering repository alignment remains separate.