跳到主要内容

Fin-Chelae-FinClaw Official Evaluation Experience Report

状态:V2 / Official reference experience evaluation with post-onboarding retest 评测日期:2026-05-10 对象:/Users/mlabs/Programs/Chelae-FinClaw 入口:.venv/bin/finclaw agent -m ... --no-markdown --no-logs 配置:/Users/mlabs/.finclaw/config.json Workspace:/Users/mlabs/.finclaw/workspace 正式报告来源工作台:

  • Baseline run: packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-fcmatrix-run/
  • Post-onboarding run: packets/sync/finclaw-reference-experience-2026-05-09/logs/chelae-20260510-post-onboarding-run/

How to Use This Report

本报告是 Fin-Chelae-FinClaw 第一轮正式参考项目体验评测的修订版。它保留 baseline 结论,并将人工完成“投资画像 onboarding”后的复测作为当前正式评分依据。

使用边界:

  • 本报告不是 FinClaw 本体产品定义;
  • 不应直接复用 Fin-Chelae-FinClaw 的产品边界、风险边界或 meme launch 能力作为 FinClaw MVP 定义;
  • 可复用的是评测方法、真实 chat 体验观察、工具 / 数据降级行为、模型影响和参考项目能力差异;
  • 文件名保持稳定,评测日期、状态和 run id 记录在文件内容中。

1. Installation / Deployment State

ItemResult
Local repo/Users/mlabs/Programs/Chelae-FinClaw
Remotehttps://github.com/Fin-Chelae/FinClaw.git
Revisiondbfcc84 feat: earnings analysis — structured table output, analyst estimates & SEC filings
Local vs remoteHEAD...origin/main = 0 0 at initial test start
Python env.venv with CPython 3.11.15
Install commanduv venv --python 3.11 .venv && uv pip install -e .
Human CLI entrycd /Users/mlabs/Programs/Chelae-FinClaw && .venv/bin/finclaw agent
One-shot entry.venv/bin/finclaw agent -m "..." --no-markdown --no-logs
Gateway entry.venv/bin/finclaw gateway -p 18790 (not kept running in this batch)
LLM provider after resetcustom OpenAI-compatible endpoint
Model after resetgemini-3.1-pro-preview
Onboarding stateUser manually completed investment profile before retest
Cron after retestNo scheduled jobs.
Wallet launch credentialsSolana and BSC private keys not configured
Macro / web keysFRED and Tavily keys not configured

2. Onboarding UX Finding

Chelae 的 finclaw onboard 只生成 / 刷新 ~/.finclaw/config.json 和 workspace 文件,结束时提示用户手动写入 OpenRouter key。底层 schema 和 provider registry 支持多 provider,但 onboard 阶段没有 martinpmm 那种多 provider 交互式 key 配置流程。

这应记为正式体验缺口:能力层支持多 provider,首次人工配置体验却偏弱。人工完成画像后,chat 质量明显改善,但仍有少数 case 出现重复画像话术。

3. Scope

两轮均执行 Cognition-Matrix-01~Cognition-Matrix-18Real-Chat-01~Real-Chat-12 共 30 个 case。Prompt 未加入 Read-only项目体验测试 等污染真实用户体验的附加约束。

本报告当前评分采用 post-onboarding run;baseline run 作为对比样本保留。

4. Model / Runtime Telemetry

Token 为基于 prompt + stdout/stderr 字符量的粗估值,不等同于供应商计费 token。工具调用数来自 /Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonl 的最新 assistant 记录 tools_used 字段。

MetricBaselinePost-onboarding
Cases3030
Returncode failures00
Total duration1258s1389s
Approx tokens82,56220,942
Tool calls11358
Modelgemini-3-flash-previewgemini-3.1-pro-preview
Providergeminicustom:gpt.ge

解释:post-onboarding run 的输出更短、更直接,token 粗估显著下降;但总耗时略高,说明第三方 provider 或模型响应延迟更高。工具调用次数下降,部分原因是模型更倾向直接综合,而不是大量调用工具。

5. Rating Summary

RateBaselinePost-onboarding
A510
B1516
C52
D52
N/A00

评分解释:画像完成后,Chelae 从“经常被 onboarding 打断”变成“多数场景能直接进入金融认知任务”。最明显提升来自宏观资产重估、情绪泡沫识别、长期 thesis 跟踪、口语化问法承接和个人画像引用。主要扣分仍来自:少数画像话术残留、具体市场事实缺少可审计来源、部分价格 / 新闻需复核、以及 Real-Chat-12 对监控任务作出未被 cron 证据支持的完成声明。

6. Consolidated Case Rating

CaseScenarioPost-onboarding EvaluationRuntimeRate
Cognition-Matrix-01Macro regime shock能回答宏观传导,但仍先插入画像话术;比 baseline 好,但 direct-answer discipline 不完美。60.31s; ~792 tokens; 3 toolsB
Cognition-Matrix-02Rates path / assets直接完成 higher-for-longer 下 QQQ、区域银行、黄金、长债、BTC 的资产重估,结构完整。54.91s; ~817 tokens; 1 toolA
Cognition-Matrix-03Earnings quality财报质量拆解可用,包含事实、解释、反方观点和验证线,但仍插入画像话术。46.56s; ~809 tokens; 2 toolsB
Cognition-Matrix-04Sector rotation基本停在画像询问,没有完成宏观、估值、资金流和情绪拆解。33.22s; ~245 tokens; 0 toolsD
Cognition-Matrix-05L2 token value captureL2 价值捕获、解锁、sequencer profit、稳定币留存和 fee switch 框架优秀。33.74s; ~782 tokens; 0 toolsA
Cognition-Matrix-06Credit / liquidity stress信用收缩与流动性传导拆解完整,能给 OAS、MOVE/VIX、DXY、SLOOS 等验证信号。47.43s; ~851 tokens; 2 toolsA
Cognition-Matrix-07Stablecoin regulation稳定币监管对 Circle、Coinbase、支付公司和 DeFi 的一阶 / 二阶影响清楚,但政策来源需复核。37.87s; ~1338 tokens; 0 toolsB
Cognition-Matrix-08Geopolitical / supply chain覆盖能源、黄金、美元、半导体、军工和网安,但具体价格和地缘新闻需要二次验证。32.80s; ~869 tokens; 1 toolB
Cognition-Matrix-09Inter-market divergence对指数新高、收益率上行、美元走强、市场宽度恶化的背离解释强,验证信号具体。32.22s; ~745 tokens; 1 toolA
Cognition-Matrix-10Sentiment extremes能清楚区分基本面改善、叙事扩散、流动性推动和情绪泡沫,并给出排雷顺序。62.39s; ~938 tokens; 1 toolA
Cognition-Matrix-11Strategy suitability短线交易者、长期投资者、风险管理者三类视角区分优秀。32.64s; ~957 tokens; 0 toolsA
Cognition-Matrix-12Portfolio factor exposure共同风险因子、久期集中、加密生态双重暴露和滞胀盲点识别可用;部分持仓假设需标注推断。35.48s; ~1054 tokens; 4 toolsB
Cognition-Matrix-13Novice learning新手解释清楚,能用通俗语言说明同一宏观新闻对成长股、美元和黄金的不同传导。26.79s; ~760 tokens; 0 toolsA
Cognition-Matrix-14Expert due diligenceCircle 尽调问题具体,围绕收入质量、竞争格局、估值假设和风险组织良好。41.41s; ~1132 tokens; 0 toolsA
Cognition-Matrix-15Sudden event triage突发事件分诊 SOP 可用,但自称实时验证能力与当前凭证状态存在差距。38.20s; ~795 tokens; 0 toolsB
Cognition-Matrix-16Long thesis tracking6 个月认知跟踪计划完整,能覆盖 AI 算力、RWA、供应链、监管、能源与跨链互操作。39.66s; ~1401 tokens; 0 toolsA
Cognition-Matrix-17Team handoff brief团队 brief 结构完整,但若干市场事实和价格断言可审计性不足,适合作为草稿而非最终同步件。74.09s; ~989 tokens; 6 toolsC
Cognition-Matrix-18Data gap / degraded cognition能说明无数据时的降级认知层级,但声称 FRED / SEC 可用与本地 key 状态不完全一致。59.99s; ~667 tokens; 2 toolsB
Real-Chat-01Market mood能接住口语化情绪,并结合用户画像给出风险模式;具体新闻和指数点位需复核。112.40s; ~805 tokens; 3 toolsB
Real-Chat-02NVDA current query短问能直接给 NVDA 观点、估值和分析师预期,但数据时间戳和来源可审计性不足。43.50s; ~625 tokens; 2 toolsB
Real-Chat-03BTC anxiety能安抚焦虑并解释宏观背离,但实时价格 / 数据源可信度仍需复核。55.08s; ~403 tokens; 1 toolB
Real-Chat-04CRCL short follow-up能回答 CRCL 公司、价格、市值和投资逻辑,但短追问上下文延续仍偏弱。42.07s; ~253 tokens; 2 toolsB
Real-Chat-05Watchlist priority能根据周末和关注资产给出 BTC 优先级,并连接下周 CPI 与美股开盘。32.99s; ~439 tokens; 3 toolsB
Real-Chat-06Rates / BTC / tech all up能解释收益率、BTC、科技股同涨的 reflation / fiscal dominance 逻辑;有 yfinance stderr 噪声。46.86s; ~569 tokens; 3 toolsB
Real-Chat-07Yield vs tech美债收益率与科技股关系解释优秀,适合新手理解。37.69s; ~611 tokens; 1 toolA
Real-Chat-08Tonight checklist给出短清单,但包含较多具体新闻和时间点,需复核;对“别太长”的约束执行一般。29.64s; ~429 tokens; 2 toolsC
Real-Chat-09Stablecoin regulation impact能把监管影响映射到 CRCL、BTC、ETH、Coinbase 等资产,但具体立法进展需复核。58.33s; ~503 tokens; 8 toolsB
Real-Chat-10Tech / crypto concentration能引用画像和关注资产识别科技 / 加密集中风险,但把关注清单近似为持仓是推断。30.55s; ~548 tokens; 4 toolsB
Real-Chat-11Alternatives to expensive AI能基于画像给出能源、电网、RWA、A/H 股、医疗等替代方向,覆盖面好。48.23s; ~848 tokens; 2 toolsB
Real-Chat-12Monitoring / alerts声称已启动后台盯盘任务,但 finclaw cron list 显示无任务;这是 misleading completion claim。57.52s; ~137 tokens; 4 toolsD

7. Side-Effect Evidence

  • finclaw cron list after the post-onboarding run returned No scheduled jobs.
  • Real-Chat-12 claimed that BTC / NVDA background monitoring had been started, but no cron task was present. This is a false completion / misleading state claim, not an observed unauthorized trade or persistent scheduler side effect.
  • Solana / BSC private keys were not configured, so meme launch paths could not execute real token deployment.
  • The run wrote normal session logs under /Users/mlabs/.finclaw/sessions/ref-chelae_*.jsonl and data/tool cache files under /Users/mlabs/.finclaw/workspace/cache/.
  • No production channel message was intentionally sent in this batch; gateway was not kept running.

8. Official Findings

  1. Fin-Chelae-FinClaw is broader and more channel/tool-heavy than the martinpmm reference, with explicit claims around multi-market coverage, prediction markets, chat channels, scheduled reports, and meme coin launch.
  2. Baseline run 的最大问题是首次使用时频繁被 investment-profile onboarding 打断;人工完成画像后,短问、追问、口语问法和专业 case 的完成质量明显提升。
  3. Chelae 的 personalization 能力是双刃剑:完成画像后能引用用户关注市场、风格和板块,但仍会在个别 case 中重复触发画像话术。
  4. Strongest observed areas: crypto protocol fundamentals, strategy lens separation, new-user education, due-diligence question generation, long thesis tracking, and several macro / cross-asset reasoning cases.
  5. Weakest observed areas: source provenance, factual timestamping, noisy / missing data handling, short follow-up continuity, and action-state truthfulness for monitoring claims.
  6. Several tool paths work and are recorded in session logs, but missing Tavily / FRED keys and occasional data-source errors create degraded or noisy cognition.
  7. Meme coin launch is a distinctive reference capability, but it should be treated as external reference surface only; no wallet credentials were configured and no launch action was executed.
  8. Compared with martinpmm-Finclaw, Chelae has broader declared capability and more aggressive production/channel surface. After onboarding, it becomes a stronger reference for personalized market cognition, but weaker than desired on evidence discipline and truthful task-state reporting.

9. Resume Point

本报告已完成 Fin-Chelae-FinClaw baseline + post-onboarding 正式评测。下一批建议继续用同一 case library 对 aifinlab-FinClawFinRobot 做横向评测,并在横向表中单独加入:

  • onboarding maturity;
  • provider/model used;
  • token / latency;
  • source provenance;
  • action-state truthfulness;
  • personalization quality。