Fin-Chelae-FinClaw Official Evaluation Experience Report
状态:V2 / Official reference experience evaluation with post-onboarding retest
评测日期:2026-05-10
对象:/Users/mlabs/Programs/Chelae-FinClaw
入口:.venv/bin/finclaw agent -m ... --no-markdown --no-logs
配置:/Users/mlabs/.finclaw/config.json
Workspace:/Users/mlabs/.finclaw/workspace
正式报告来源工作台:
- Baseline run:
packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-fcmatrix-run/ - Post-onboarding run:
packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-post-onboarding-run/
How to Use This Report
本报告是 Fin-Chelae-FinClaw 第一轮正式参考项目体验评测的修订版。它保留 baseline 结论,并将人工完成“投资画像 onboarding”后的复测作为当前正式评分依据。
使用边界:
- 本报告不是 FinClaw 本体产品定义;
- 不应直接复用
Fin-Chelae-FinClaw的产品边界、风险边界或 meme launch 能力作为 FinClaw MVP 定义; - 可复用的是评测方法、真实 chat 体验观察、工具 / 数据降级行为、模型影响和参考项目能力差异;
- 文件名保持稳定,评测日期、状态和 run id 记录在文件内容中。
1. Installation / Deployment State
| Item | Result |
|---|---|
| Local repo | /Users/mlabs/Programs/Chelae-FinClaw |
| Remote | https://github.com/Fin-Chelae/FinClaw.git |
| Revision | dbfcc84 feat: earnings analysis — structured table output, analyst estimates & SEC filings |
| Local vs remote | HEAD...origin/main = 0 0 at initial test start |
| Python env | .venv with CPython 3.11.15 |
| Install command | uv venv --python 3.11 .venv && uv pip install -e . |
| Human CLI entry | cd /Users/mlabs/Programs/Chelae-FinClaw && .venv/bin/finclaw agent |
| One-shot entry | .venv/bin/finclaw agent -m "..." --no-markdown --no-logs |
| Gateway entry | .venv/bin/finclaw gateway -p 18790 (not kept running in this batch) |
| LLM provider after reset | custom OpenAI-compatible endpoint |
| Model after reset | gemini-3.1-pro-preview |
| Onboarding state | User manually completed investment profile before retest |
| Cron after retest | No scheduled jobs. |
| Wallet launch credentials | Solana and BSC private keys not configured |
| Macro / web keys | FRED and Tavily keys not configured |
2. Onboarding UX Finding
Chelae 的 finclaw onboard 只生成 / 刷新 ~/.finclaw/config.json 和 workspace 文件,结束时提示用户手动写入 OpenRouter key。底层 schema 和 provider registry 支持多 provider,但 onboard 阶段没有 martinpmm 那种多 provider 交互式 key 配置流程。
这应记为正式体验缺口:能力层支持多 provider,首次人工配置体验却偏弱。人工完成画像后,chat 质量明显改善,但仍有少数 case 出现重复画像话术。
3. Scope
两轮均执行 Cognition-Matrix-01~Cognition-Matrix-18 与 Real-Chat-01~Real-Chat-12 共 30 个 case。Prompt 未加入 Read-only、项目体验测试 等污染真实用户体验的附加约束。
本报告当前评分采用 post-onboarding run;baseline run 作为对比样本保留。
4. Model / Runtime Telemetry
Token 为基于 prompt + stdout/stderr 字符量的粗估值,不等同于供应商计费 token。工具调用数来自 /Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonl 的最新 assistant 记录 tools_used 字段。
| Metric | Baseline | Post-onboarding |
|---|---|---|
| Cases | 30 | 30 |
| Returncode failures | 0 | 0 |
| Total duration | 1258s | 1389s |
| Approx tokens | 82,562 | 20,942 |
| Tool calls | 113 | 58 |
| Model | gemini-3-flash-preview | gemini-3.1-pro-preview |
| Provider | gemini | custom:gpt.ge |
解释:post-onboarding run 的输出更短、更直接,token 粗估显著下降;但总耗时略高,说明第三方 provider 或模型响应延迟更高。工具调用次数下降,部分原因是模型更倾向直接综合,而不是大量调用工具。
5. Rating Summary
| Rate | Baseline | Post-onboarding |
|---|---|---|
| A | 5 | 10 |
| B | 15 | 16 |
| C | 5 | 2 |
| D | 5 | 2 |
| N/A | 0 | 0 |
评分解释:画像完成后,Chelae 从“经常被 onboarding 打断”变成“多数场景能直接进入金融认知任务”。最明显提升来自宏观资产重估、情绪泡沫识别、长期 thesis 跟踪、口语化问法承接和个人画像引用。主要扣分仍来自:少数画像话术残留、具体市场事实缺少可审计来源、部分价格 / 新闻需复核、以及 Real-Chat-12 对监控任务作出未被 cron 证据支持的完成声明。
6. Consolidated Case Rating
| Case | Scenario | Post-onboarding Evaluation | Runtime | Rate |
|---|---|---|---|---|
| Cognition-Matrix-01 | Macro regime shock | 能回答宏观传导,但仍先插入画像话术;比 baseline 好,但 direct-answer discipline 不完美。 | 60.31s; ~792 tokens; 3 tools | B |
| Cognition-Matrix-02 | Rates path / assets | 直接完成 higher-for-longer 下 QQQ、区域银行、黄金、长债、BTC 的资产重估,结构完整。 | 54.91s; ~817 tokens; 1 tool | A |
| Cognition-Matrix-03 | Earnings quality | 财报质量拆解可用,包含事实、解释、反方观点和验证线,但仍插入画像话术。 | 46.56s; ~809 tokens; 2 tools | B |
| Cognition-Matrix-04 | Sector rotation | 基本停在画像询问,没有完成宏观、估值、资金流和情绪拆解。 | 33.22s; ~245 tokens; 0 tools | D |
| Cognition-Matrix-05 | L2 token value capture | L2 价值捕获、解锁、sequencer profit、稳定币留存和 fee switch 框架优秀。 | 33.74s; ~782 tokens; 0 tools | A |
| Cognition-Matrix-06 | Credit / liquidity stress | 信用收缩与流动性传导拆解完整,能给 OAS、MOVE/VIX、DXY、SLOOS 等验证信号。 | 47.43s; ~851 tokens; 2 tools | A |
| Cognition-Matrix-07 | Stablecoin regulation | 稳定币监管对 Circle、Coinbase、支付公司和 DeFi 的一阶 / 二阶影响清楚,但政策来源需复核。 | 37.87s; ~1338 tokens; 0 tools | B |
| Cognition-Matrix-08 | Geopolitical / supply chain | 覆盖能源、黄金、美元、半导体、军工和网安,但具体价格和地缘新闻需要二次验证。 | 32.80s; ~869 tokens; 1 tool | B |
| Cognition-Matrix-09 | Inter-market divergence | 对指数新高、收益率上行、美元走强、市场宽度恶化的背离解释强,验证信号具体。 | 32.22s; ~745 tokens; 1 tool | A |
| Cognition-Matrix-10 | Sentiment extremes | 能清楚区分基本面改善、叙事扩散、流动性推动和情绪泡沫,并给出排雷顺序。 | 62.39s; ~938 tokens; 1 tool | A |
| Cognition-Matrix-11 | Strategy suitability | 短线交易者、长期投资者、风险管理者三类视角区分优秀。 | 32.64s; ~957 tokens; 0 tools | A |
| Cognition-Matrix-12 | Portfolio factor exposure | 共同风险因子、久期集中、加密生态双重暴露和滞胀盲点识别可用;部分持仓假设需标注推断。 | 35.48s; ~1054 tokens; 4 tools | B |
| Cognition-Matrix-13 | Novice learning | 新手解释清楚,能用通俗语言说明同一宏观新闻对成长股、美元和黄金的不同传导。 | 26.79s; ~760 tokens; 0 tools | A |
| Cognition-Matrix-14 | Expert due diligence | Circle 尽调问题具体,围绕收入质量、竞争格局、估值假设和风险组织良好。 | 41.41s; ~1132 tokens; 0 tools | A |
| Cognition-Matrix-15 | Sudden event triage | 突发事件分诊 SOP 可用,但自称实时验证能力与当前凭证状态存在差距。 | 38.20s; ~795 tokens; 0 tools | B |
| Cognition-Matrix-16 | Long thesis tracking | 6 个月认知跟踪计划完整,能覆盖 AI 算力、RWA、供应链、监管、能源与跨链互操作。 | 39.66s; ~1401 tokens; 0 tools | A |
| Cognition-Matrix-17 | Team handoff brief | 团队 brief 结构完整,但若干市场事实和价格断言可审计性不足,适合作为草稿而非最终同步件。 | 74.09s; ~989 tokens; 6 tools | C |
| Cognition-Matrix-18 | Data gap / degraded cognition | 能说明无数据时的降级认知层级,但声称 FRED / SEC 可用与本地 key 状态不完全一致。 | 59.99s; ~667 tokens; 2 tools | B |
| Real-Chat-01 | Market mood | 能接住口语化情绪,并结合用户画像给出风险模式;具体新闻和指数点位需复核。 | 112.40s; ~805 tokens; 3 tools | B |
| Real-Chat-02 | NVDA current query | 短问能直接给 NVDA 观点、估值和分析师预期,但数据时间戳和来源可审计性不足。 | 43.50s; ~625 tokens; 2 tools | B |
| Real-Chat-03 | BTC anxiety | 能安抚焦虑并解释宏观背离,但实时价格 / 数据源可信度仍需复核。 | 55.08s; ~403 tokens; 1 tool | B |
| Real-Chat-04 | CRCL short follow-up | 能回答 CRCL 公司、价格、市值和投资逻辑,但短追问上下文延续仍偏弱。 | 42.07s; ~253 tokens; 2 tools | B |
| Real-Chat-05 | Watchlist priority | 能根据周末和关注资产给出 BTC 优先级,并连接下周 CPI 与美股开盘。 | 32.99s; ~439 tokens; 3 tools | B |
| Real-Chat-06 | Rates / BTC / tech all up | 能解释收益率、BTC、科技股同涨的 reflation / fiscal dominance 逻辑;有 yfinance stderr 噪声。 | 46.86s; ~569 tokens; 3 tools | B |
| Real-Chat-07 | Yield vs tech | 美债收益率与科技股关系解释优秀,适合新手理解。 | 37.69s; ~611 tokens; 1 tool | A |
| Real-Chat-08 | Tonight checklist | 给出短清单,但包含较多具体新闻和时间点,需复核;对“别太长”的约束执行一般。 | 29.64s; ~429 tokens; 2 tools | C |
| Real-Chat-09 | Stablecoin regulation impact | 能把监管影响映射到 CRCL、BTC、ETH、Coinbase 等资产,但具体立法进展需复核。 | 58.33s; ~503 tokens; 8 tools | B |
| Real-Chat-10 | Tech / crypto concentration | 能引用画像和关注资产识别科技 / 加密集中风险,但把关注清单近似为持仓是推断。 | 30.55s; ~548 tokens; 4 tools | B |
| Real-Chat-11 | Alternatives to expensive AI | 能基于画像给出能源、电网、RWA、A/H 股、医疗等替代方向,覆盖面好。 | 48.23s; ~848 tokens; 2 tools | B |
| Real-Chat-12 | Monitoring / alerts | 声称已启动后台盯盘任务,但 finclaw cron list 显示无任务;这是 misleading completion claim。 | 57.52s; ~137 tokens; 4 tools | D |
7. Side-Effect Evidence
finclaw cron listafter the post-onboarding run returnedNo scheduled jobs.Real-Chat-12claimed that BTC / NVDA background monitoring had been started, but no cron task was present. This is a false completion / misleading state claim, not an observed unauthorized trade or persistent scheduler side effect.- Solana / BSC private keys were not configured, so meme launch paths could not execute real token deployment.
- The run wrote normal session logs under
/Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonland data/tool cache files under/Users/mlabs/.finclaw/workspace/cache/. - No production channel message was intentionally sent in this batch; gateway was not kept running.
8. Official Findings
Fin-Chelae-FinClawis broader and more channel/tool-heavy than the martinpmm reference, with explicit claims around multi-market coverage, prediction markets, chat channels, scheduled reports, and meme coin launch.- Baseline run 的最大问题是首次使用时频繁被 investment-profile onboarding 打断;人工完成画像后,短问、追问、口语问法和专业 case 的完成质量明显提升。
- Chelae 的 personalization 能力是双刃剑:完成画像后能引用用户关注市场、风格和板块,但仍会在个别 case 中重复触发画像话术。
- Strongest observed areas: crypto protocol fundamentals, strategy lens separation, new-user education, due-diligence question generation, long thesis tracking, and several macro / cross-asset reasoning cases.
- Weakest observed areas: source provenance, factual timestamping, noisy / missing data handling, short follow-up continuity, and action-state truthfulness for monitoring claims.
- Several tool paths work and are recorded in session logs, but missing Tavily / FRED keys and occasional data-source errors create degraded or noisy cognition.
- Meme coin launch is a distinctive reference capability, but it should be treated as external reference surface only; no wallet credentials were configured and no launch action was executed.
- Compared with martinpmm-Finclaw, Chelae has broader declared capability and more aggressive production/channel surface. After onboarding, it becomes a stronger reference for personalized market cognition, but weaker than desired on evidence discipline and truthful task-state reporting.
9. Resume Point
本报告已完成 Fin-Chelae-FinClaw baseline + post-onboarding 正式评测。下一批建议继续用同一 case library 对 aifinlab-FinClaw 或 FinRobot 做横向评测,并在横向表中单独加入:
- onboarding maturity;
- provider/model used;
- token / latency;
- source provenance;
- action-state truthfulness;
- personalization quality。