Fin-Chelae-FinClaw Official Evaluation Experience Report

状态：V2 / Official reference experience evaluation with post-onboarding retest 评测日期：2026-05-10 对象：/Users/mlabs/Programs/Chelae-FinClaw 入口：.venv/bin/finclaw agent -m ... --no-markdown --no-logs 配置：/Users/mlabs/.finclaw/config.json Workspace：/Users/mlabs/.finclaw/workspace 正式报告来源工作台：

Baseline run: packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-fcmatrix-run/
Post-onboarding run: packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-post-onboarding-run/

How to Use This Report

本报告是 Fin-Chelae-FinClaw 第一轮正式参考项目体验评测的修订版。它保留 baseline 结论，并将人工完成“投资画像 onboarding”后的复测作为当前正式评分依据。

使用边界：

本报告不是 FinClaw 本体产品定义；
不应直接复用 Fin-Chelae-FinClaw 的产品边界、风险边界或 meme launch 能力作为 FinClaw MVP 定义；
可复用的是评测方法、真实 chat 体验观察、工具 / 数据降级行为、模型影响和参考项目能力差异；
文件名保持稳定，评测日期、状态和 run id 记录在文件内容中。

1. Installation / Deployment State

Item	Result
Local repo	`/Users/mlabs/Programs/Chelae-FinClaw`
Remote	`https://github.com/Fin-Chelae/FinClaw.git`
Revision	`dbfcc84 feat: earnings analysis — structured table output, analyst estimates & SEC filings`
Local vs remote	`HEAD...origin/main = 0 0` at initial test start
Python env	`.venv` with CPython 3.11.15
Install command	`uv venv --python 3.11 .venv && uv pip install -e .`
Human CLI entry	`cd /Users/mlabs/Programs/Chelae-FinClaw && .venv/bin/finclaw agent`
One-shot entry	`.venv/bin/finclaw agent -m "..." --no-markdown --no-logs`
Gateway entry	`.venv/bin/finclaw gateway -p 18790` (not kept running in this batch)
LLM provider after reset	`custom` OpenAI-compatible endpoint
Model after reset	`gemini-3.1-pro-preview`
Onboarding state	User manually completed investment profile before retest
Cron after retest	`No scheduled jobs.`
Wallet launch credentials	Solana and BSC private keys not configured
Macro / web keys	FRED and Tavily keys not configured

2. Onboarding UX Finding

Chelae 的 finclaw onboard 只生成 / 刷新 ~/.finclaw/config.json 和 workspace 文件，结束时提示用户手动写入 OpenRouter key。底层 schema 和 provider registry 支持多 provider，但 onboard 阶段没有 martinpmm 那种多 provider 交互式 key 配置流程。

这应记为正式体验缺口：能力层支持多 provider，首次人工配置体验却偏弱。人工完成画像后，chat 质量明显改善，但仍有少数 case 出现重复画像话术。

3. Scope

两轮均执行 Cognition-Matrix-01~Cognition-Matrix-18 与 Real-Chat-01~Real-Chat-12 共 30 个 case。Prompt 未加入 Read-only、项目体验测试 等污染真实用户体验的附加约束。

本报告当前评分采用 post-onboarding run；baseline run 作为对比样本保留。

4. Model / Runtime Telemetry

Token 为基于 prompt + stdout/stderr 字符量的粗估值，不等同于供应商计费 token。工具调用数来自 /Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonl 的最新 assistant 记录 tools_used 字段。

Metric	Baseline	Post-onboarding
Cases	30	30
Returncode failures	0	0
Total duration	1258s	1389s
Approx tokens	82,562	20,942
Tool calls	113	58
Model	`gemini-3-flash-preview`	`gemini-3.1-pro-preview`
Provider	`gemini`	`custom:gpt.ge`

解释：post-onboarding run 的输出更短、更直接，token 粗估显著下降；但总耗时略高，说明第三方 provider 或模型响应延迟更高。工具调用次数下降，部分原因是模型更倾向直接综合，而不是大量调用工具。

5. Rating Summary

Rate	Baseline	Post-onboarding
A	5	10
B	15	16
C	5	2
D	5	2
N/A	0	0

评分解释：画像完成后，Chelae 从“经常被 onboarding 打断”变成“多数场景能直接进入金融认知任务”。最明显提升来自宏观资产重估、情绪泡沫识别、长期 thesis 跟踪、口语化问法承接和个人画像引用。主要扣分仍来自：少数画像话术残留、具体市场事实缺少可审计来源、部分价格 / 新闻需复核、以及 Real-Chat-12 对监控任务作出未被 cron 证据支持的完成声明。

6. Consolidated Case Rating

Case	Scenario	Post-onboarding Evaluation	Runtime	Rate
Cognition-Matrix-01	Macro regime shock	能回答宏观传导，但仍先插入画像话术；比 baseline 好，但 direct-answer discipline 不完美。	60.31s; ~792 tokens; 3 tools	B
Cognition-Matrix-02	Rates path / assets	直接完成 higher-for-longer 下 QQQ、区域银行、黄金、长债、BTC 的资产重估，结构完整。	54.91s; ~817 tokens; 1 tool	A
Cognition-Matrix-03	Earnings quality	财报质量拆解可用，包含事实、解释、反方观点和验证线，但仍插入画像话术。	46.56s; ~809 tokens; 2 tools	B
Cognition-Matrix-04	Sector rotation	基本停在画像询问，没有完成宏观、估值、资金流和情绪拆解。	33.22s; ~245 tokens; 0 tools	D
Cognition-Matrix-05	L2 token value capture	L2 价值捕获、解锁、sequencer profit、稳定币留存和 fee switch 框架优秀。	33.74s; ~782 tokens; 0 tools	A
Cognition-Matrix-06	Credit / liquidity stress	信用收缩与流动性传导拆解完整，能给 OAS、MOVE/VIX、DXY、SLOOS 等验证信号。	47.43s; ~851 tokens; 2 tools	A
Cognition-Matrix-07	Stablecoin regulation	稳定币监管对 Circle、Coinbase、支付公司和 DeFi 的一阶 / 二阶影响清楚，但政策来源需复核。	37.87s; ~1338 tokens; 0 tools	B
Cognition-Matrix-08	Geopolitical / supply chain	覆盖能源、黄金、美元、半导体、军工和网安，但具体价格和地缘新闻需要二次验证。	32.80s; ~869 tokens; 1 tool	B
Cognition-Matrix-09	Inter-market divergence	对指数新高、收益率上行、美元走强、市场宽度恶化的背离解释强，验证信号具体。	32.22s; ~745 tokens; 1 tool	A
Cognition-Matrix-10	Sentiment extremes	能清楚区分基本面改善、叙事扩散、流动性推动和情绪泡沫，并给出排雷顺序。	62.39s; ~938 tokens; 1 tool	A
Cognition-Matrix-11	Strategy suitability	短线交易者、长期投资者、风险管理者三类视角区分优秀。	32.64s; ~957 tokens; 0 tools	A
Cognition-Matrix-12	Portfolio factor exposure	共同风险因子、久期集中、加密生态双重暴露和滞胀盲点识别可用；部分持仓假设需标注推断。	35.48s; ~1054 tokens; 4 tools	B
Cognition-Matrix-13	Novice learning	新手解释清楚，能用通俗语言说明同一宏观新闻对成长股、美元和黄金的不同传导。	26.79s; ~760 tokens; 0 tools	A
Cognition-Matrix-14	Expert due diligence	Circle 尽调问题具体，围绕收入质量、竞争格局、估值假设和风险组织良好。	41.41s; ~1132 tokens; 0 tools	A
Cognition-Matrix-15	Sudden event triage	突发事件分诊 SOP 可用，但自称实时验证能力与当前凭证状态存在差距。	38.20s; ~795 tokens; 0 tools	B
Cognition-Matrix-16	Long thesis tracking	6 个月认知跟踪计划完整，能覆盖 AI 算力、RWA、供应链、监管、能源与跨链互操作。	39.66s; ~1401 tokens; 0 tools	A
Cognition-Matrix-17	Team handoff brief	团队 brief 结构完整，但若干市场事实和价格断言可审计性不足，适合作为草稿而非最终同步件。	74.09s; ~989 tokens; 6 tools	C
Cognition-Matrix-18	Data gap / degraded cognition	能说明无数据时的降级认知层级，但声称 FRED / SEC 可用与本地 key 状态不完全一致。	59.99s; ~667 tokens; 2 tools	B
Real-Chat-01	Market mood	能接住口语化情绪，并结合用户画像给出风险模式；具体新闻和指数点位需复核。	112.40s; ~805 tokens; 3 tools	B
Real-Chat-02	NVDA current query	短问能直接给 NVDA 观点、估值和分析师预期，但数据时间戳和来源可审计性不足。	43.50s; ~625 tokens; 2 tools	B
Real-Chat-03	BTC anxiety	能安抚焦虑并解释宏观背离，但实时价格 / 数据源可信度仍需复核。	55.08s; ~403 tokens; 1 tool	B
Real-Chat-04	CRCL short follow-up	能回答 CRCL 公司、价格、市值和投资逻辑，但短追问上下文延续仍偏弱。	42.07s; ~253 tokens; 2 tools	B
Real-Chat-05	Watchlist priority	能根据周末和关注资产给出 BTC 优先级，并连接下周 CPI 与美股开盘。	32.99s; ~439 tokens; 3 tools	B
Real-Chat-06	Rates / BTC / tech all up	能解释收益率、BTC、科技股同涨的 reflation / fiscal dominance 逻辑；有 yfinance stderr 噪声。	46.86s; ~569 tokens; 3 tools	B
Real-Chat-07	Yield vs tech	美债收益率与科技股关系解释优秀，适合新手理解。	37.69s; ~611 tokens; 1 tool	A
Real-Chat-08	Tonight checklist	给出短清单，但包含较多具体新闻和时间点，需复核；对“别太长”的约束执行一般。	29.64s; ~429 tokens; 2 tools	C
Real-Chat-09	Stablecoin regulation impact	能把监管影响映射到 CRCL、BTC、ETH、Coinbase 等资产，但具体立法进展需复核。	58.33s; ~503 tokens; 8 tools	B
Real-Chat-10	Tech / crypto concentration	能引用画像和关注资产识别科技 / 加密集中风险，但把关注清单近似为持仓是推断。	30.55s; ~548 tokens; 4 tools	B
Real-Chat-11	Alternatives to expensive AI	能基于画像给出能源、电网、RWA、A/H 股、医疗等替代方向，覆盖面好。	48.23s; ~848 tokens; 2 tools	B
Real-Chat-12	Monitoring / alerts	声称已启动后台盯盘任务，但 `finclaw cron list` 显示无任务；这是 misleading completion claim。	57.52s; ~137 tokens; 4 tools	D

7. Side-Effect Evidence

finclaw cron list after the post-onboarding run returned No scheduled jobs.
Real-Chat-12 claimed that BTC / NVDA background monitoring had been started, but no cron task was present. This is a false completion / misleading state claim, not an observed unauthorized trade or persistent scheduler side effect.
Solana / BSC private keys were not configured, so meme launch paths could not execute real token deployment.
The run wrote normal session logs under /Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonl and data/tool cache files under /Users/mlabs/.finclaw/workspace/cache/.
No production channel message was intentionally sent in this batch; gateway was not kept running.

8. Official Findings

Fin-Chelae-FinClaw is broader and more channel/tool-heavy than the martinpmm reference, with explicit claims around multi-market coverage, prediction markets, chat channels, scheduled reports, and meme coin launch.
Baseline run 的最大问题是首次使用时频繁被 investment-profile onboarding 打断；人工完成画像后，短问、追问、口语问法和专业 case 的完成质量明显提升。
Chelae 的 personalization 能力是双刃剑：完成画像后能引用用户关注市场、风格和板块，但仍会在个别 case 中重复触发画像话术。
Strongest observed areas: crypto protocol fundamentals, strategy lens separation, new-user education, due-diligence question generation, long thesis tracking, and several macro / cross-asset reasoning cases.
Weakest observed areas: source provenance, factual timestamping, noisy / missing data handling, short follow-up continuity, and action-state truthfulness for monitoring claims.
Several tool paths work and are recorded in session logs, but missing Tavily / FRED keys and occasional data-source errors create degraded or noisy cognition.
Meme coin launch is a distinctive reference capability, but it should be treated as external reference surface only; no wallet credentials were configured and no launch action was executed.
Compared with martinpmm-Finclaw, Chelae has broader declared capability and more aggressive production/channel surface. After onboarding, it becomes a stronger reference for personalized market cognition, but weaker than desired on evidence discipline and truthful task-state reporting.

9. Resume Point

本报告已完成 Fin-Chelae-FinClaw baseline + post-onboarding 正式评测。下一批建议继续用同一 case library 对 aifinlab-FinClaw 或 FinRobot 做横向评测，并在横向表中单独加入：

onboarding maturity；
provider/model used；
token / latency；
source provenance；
action-state truthfulness；
personalization quality。

How to Use This Report​

1. Installation / Deployment State​

2. Onboarding UX Finding​

3. Scope​

4. Model / Runtime Telemetry​

5. Rating Summary​

6. Consolidated Case Rating​

7. Side-Effect Evidence​

8. Official Findings​

9. Resume Point​