FinClaw V1 Observability and Telemetry Design

状态：Accepted Initial Design / B-5 工程蓝图（Wave 2）日期：2026-05-16 项目：FinClaw 文档级别：项目级工程蓝图（V1 可观测性与遥测基线）上游文档：v1-tech-stack-and-architecture-design.md §6 / §8（B-1）、v1-data-and-persistence-design.md §3 / §7（B-2）、v1-api-contract-design.md §4.2.1 / §5 / §7（B-3）、v1-engineering-kickoff-decisions.md D-04 / D-12、v1-model-and-provider-policy.md §7 配套文档：v1-commercial-signal-instrumentation-design.md（CS 边界）、v1-cost-and-token-budget-design.md（B-7，双向对齐）

1. Purpose

本文是 FinClaw V1 工程仓库 /Users/mlabs/Programs/CurvatureLabs/finclaw/ 在工程可观测性层的蓝图。

回答 4 个问题：

V1 内测 3 人规模下，工程层最少需要采集哪些信号才能「看得见出错、看得见慢、看得见超预算」？
logs / metrics / traces 三类信号在 V1 单机 docker-compose 下如何落盘、如何被人工查阅？
与 v1-commercial-signal-instrumentation-design.md（产品/商业信号）的边界在哪里？
凭证 / 私钥 / 助记词 / 用户 PII 如何在 telemetry 中绝对避免泄漏？

本文不重写商业信号事件 catalog（CS = 产品/商业信号；本文 = 工程信号）、不引入 Datadog / Sentry / Grafana / Prometheus（D-04 / B-1 §8.4 已锁定 V1 stack 不含这些）、不展开 cost 模型本身（→ v1-cost-and-token-budget-design.md B-7）。

1.1 与 CS instrumentation 的边界（重要）

维度	Commercial Signal（CS / B-cs-*）	Engineering Telemetry（本文 / B-5）
关注问题	用户是否在产生商业价值？funnel / 留存 / 付费意愿	服务是否健康？任务是否慢？是否超预算？是否出错？
触发主体	用户行为 / 产品事件	系统 / 框架内部事件
主消费者	trial closeout 报告、PM	工程师 / 项目发起人（运维视角）
数据敏感度	受 ProfileConsent 闸门，撤回 48h 内删除	默认无 PII；用户 ID / session ID hash 后入库
持久化路径	`data/events/cs/...`（B-2 §4.1）	`data/events/system/...`、`data/traces/...`、`logs/...`、`data/metrics/...`
撤回机制	consent 撤回 48h 内 purge	不受 consent 撤回影响（因不含 PII）

两条管道物理隔离：写入路径、SQLite 索引、报表脚本都独立。CS event 经 event_sink.py 商业信号子模块；工程 telemetry 经 server/obs/* 子模块（见 §4.3）。

2. Goals and Non-Goals

2.1 Goals

Goal	落点
G-1	把 v1-agent-orchestration-design.md §2 Layers 8 层与 B-1 §6 Service Components 的关键边界事件全部覆盖到 logs / metrics / traces 三类信号
G-2	让项目发起人在 1 分钟内能回答：「过去 1 小时是否有失败任务 / 是否超预算 / 是否触发 boundary block / 是否 SSE 滞后」
G-3	与 B-2 §3 Trace 持久化层、B-3 §5 SSE catalog 1:1 对齐：trace 的 span 命名与 SSE event 名共享词汇表
G-4	为 v1-cost-and-token-budget-design.md B-7 提供 cost telemetry hook 的 contract（§10），双向对齐
G-5	在 trial 启动前不依赖任何外部 SaaS（Sentry / Datadog 被 D-04 stack 锁定排除）
G-6	任何 telemetry 落地都满足 §12 隐私边界：不写 sensitive_input_* 明文、不写 LLM raw prompt / response、用户 ID 与 session ID 在 trace 中 hash

2.2 Non-Goals

Non-Goal	理由
不接入 Sentry / Datadog / New Relic / Grafana	D-04 stack lock；V1 内测 3 人 grep + tail 即可
不引入 Prometheus / 自部署 metrics push gateway	单进程 + 文件 metrics + Python 报表脚本足够；B-1 §8.4 默认不含 Prometheus
不实现自动 alerting / pager / webhook	trial 内测人工 grep；trial 退出前再评估
不为 LLM raw prompt / response 默认入 trace	隐私（§12.3）；dev 本机用 `--full-trace` flag 可选
不在工程层重复采集 CS（funnel / save / retention）	边界（§1.1）
不实现分布式 trace propagation（W3C TraceContext）	V1 单进程内 trace；多服务时（V2）再引入
不引入 SLO / error budget 框架	V1 用阈值警报够用；正式 SLO 留 V2

2.3 与既有决策对齐

D-04：本文不引入 Anthropic / Datadog / Sentry 任何受 stack 锁定排除的 SaaS；
D-12：隐私 / 合规复核降紧迫但不豁免 → §12 在 telemetry 层先行落地；
B-1 §6 Service Components：本文新增 server/obs/* 子模块，与 8 层组件横切对齐；
B-2 §2.4 Trace + §7.1 retention：trace 保留 90 天后归档，与本文 §6.5 retention 完全一致；
B-3 §5 SSE catalog：trace span 名共享 SSE event 词汇表（§6.2）。

3. Telemetry Categories

V1 工程 telemetry 分三大类，对应 OpenTelemetry / observability industry standard 的三支柱，但 V1 仅做本地文件实现：

类别	V1 实现	文件位置	写入者	主消费者
Logs	structlog → JSON line	`logs/<YYYY-MM-DD>.jsonl`（stdout 镜像）	全部组件（`server/obs/logging.py` 暴露 `get_logger()`）	工程师 grep / tail；error.log digest 脚本
Metrics	计数器 + 直方图 → JSONL flush + SQLite 聚合	`data/metrics/<YYYY-MM-DD>.jsonl` + `data/metrics/index.sqlite`（旁路）	`server/obs/metrics.py`（计数器 / Histogram in-memory，定期 flush）	`make obs-summary` 脚本；项目发起人手动 query
Traces	轻量 JSON span → JSONL（一 task 一份）	`data/traces/<task_id>.jsonl`（与 B-2 §4.1 一致）	`server/obs/trace.py`（Context-managed span，复用 B-2 §2.4 Trace 对象）	调试 / postmortem / eval reviewer

3.1 三类信号的边界

信号	何时用	何时不用
Logs	任何文本日志、调试信息、状态变更、安全相关事件、未捕获异常	不用作 metrics 聚合（grep + count 不可持续）
Metrics	数值化、需要做 percentile / count / rate 的信号（route latency、token cost、SSE lag）	不用作错误详情（无 stack trace）
Traces	一次任务的因果链（ReAct loop / advisor 调用 / tool 调用 / 写盘）	不用作长周期聚合（每 task 一份，不聚合）

3.2 Stack 强约束

日志库：structlog（B-1 §3 已锁定），不引入 loguru / logging-as-service；
Metrics：纯 Python 自实现（in-memory counter / histogram + flush 协程），不引入 prometheus_client / statsd；
Traces：纯 JSONL，不引入 OpenTelemetry SDK / Jaeger / Zipkin（V2 评估）。

4. Logs Schema

4.1 统一日志记录字段

所有日志记录走 server/obs/logging.py 的 get_logger(component: str)，落盘为 JSON line：

log_record:
  ts: "2026-05-18T10:23:04.123Z"            # ISO-8601 UTC
  level: "info" | "warn" | "error" | "debug"
  component: "agent.runtime" | "api.cognition_routes" | "agent.boundary_guard" | ...
  message: string                            # 人类可读
  event: string                              # snake_case 稳定标识符，例如 "task_started" / "boundary_block" / "provider_failover"
  trace_id: string | null                    # 关联 §6 trace；同一 task 的 logs 共享同一 trace_id
  span_id: string | null                     # 当前 span（如有）
  user_anon_id: string | null                # HMAC(user_id, salt)；不含明文 user_id
  request_id: string | null                  # HTTP request 关联
  data:                                      # 结构化 payload（事件特定字段）
    <event-specific keys>
  schema_version: "1.0"

4.2 强字段约束

字段	约束
`ts`	ISO-8601 + UTC + `Z` 后缀，与 B-2 §5.4 一致
`component`	与 B-1 §6.1 Service Components 模块路径 1:1 对齐：`api.` / `agent.` / `tools.` / `obs.`
`event`	snake_case；新增 event 需在 §5 metrics 或 §6 trace span 中有对应条目
`trace_id`	若日志在 task 上下文中产生必须填入；non-task 日志（启动 / shutdown）可为 null
`user_anon_id`	永不填入明文 `user_id`；§12.2
`data`	不允许包含 `prompt` / `response` / `credential` / `private_key` / `wallet_*` / `api_key` 字段（CI lint，§12.4）

4.3 关键 logger 命名（component 字段）

Component	用途
`api.cognition_routes`	Snapshot / Thread / Checkpoint endpoint
`api.consent_routes`	ProfileConsent endpoint
`api.sse`	SSE 连接生命周期
`api.errors`	4xx / 5xx 错误路径
`agent.runtime`	ReAct loop 调度
`agent.task_router`	7 类 route 识别
`agent.context_engine`	system prompt 构建
`agent.skill_manager`	Skill 加载 / 选择
`agent.advisor_planner`	Advisor 编排
`agent.boundary_guard`	边界拦截事件（高敏感）
`agent.sensitive_input`	敏感输入分类（高敏感）
`agent.object_writer`	对象写盘
`agent.cognition_store`	持久化 IO
`agent.llm_client`	provider 调用包装
`agent.provider_router`	failover 决策
`agent.cost_telemetry`	cost 计量（横切，与 §10 共享）
`obs.metrics`	metrics flush
`obs.trace`	trace flush

4.4 日志输出

输出	用途	实现
stdout JSON	docker-compose `docker logs` 直接可看	structlog `JSONRenderer()`
`logs/<YYYY-MM-DD>.jsonl`	按日归档，便于 grep	structlog `WriteLoggerFactory`
`logs/error.log`	仅 `level >= error` 的日志副本（§8 错误捕获）	filter handler

logs/ 文件不入 git；.gitignore 中已含。Trial 启动前 sanity check：logs/ 目录写权限验证。

4.5 日志的等级语义

Level	含义	触发示例
`debug`	仅 dev 模式开启；trial-prod 默认关闭	system prompt 长度估算、子调用展开
`info`	正常事件	`task_started`、`snapshot_completed`、`provider_call_ok`
`warn`	可恢复异常 / 降级	`provider_failover_triggered`、`degradation_notice_emitted`、`evidence_status: stale`
`error`	不可恢复 / 未预期	`provider_failure_persistent`、`object_writer_atomic_write_failed`、`unhandled_exception`

5. Metrics Catalog

5.1 Metric 类型

V1 仅实现 2 种 metric type：

类型	用途	实现
`counter`	累计计数	int / float 累加；flush 时记录绝对值 + delta
`histogram`	时延 / 大小分布	固定桶 `[10, 50, 100, 250, 500, 1_000, 2_500, 5_000, 10_000, 30_000, 60_000, +Inf]` 毫秒；P50 / P90 / P99 由 flush 协程算

不实现 gauge（V1 没有需要 sample-by-sample 的连续值，例如内存占用 — 这类查 docker stats 即可）。

5.2 必采指标清单（按 v1-agent-orchestration-design.md §2 Layers 8 层）

5.2.1 API 层

Metric	Type	Labels	触发
`api_request_total`	counter	`route`, `method`, `status_code`	每次 HTTP 请求完成
`api_request_duration_ms`	histogram	`route`, `method`, `status_code`	每次 HTTP 请求
`api_4xx_total`	counter	`route`, `code`（B-3 §7.2）	4xx 错误
`api_5xx_total`	counter	`route`	5xx 错误
`api_sse_open_total`	counter	—	SSE 连接打开
`api_sse_close_total`	counter	`reason: completed	client_disconnect

5.2.2 Agent 层

Metric	Type	Labels	触发
`route_latency_ms`	histogram	`task_type`	Task Router 完成识别
`tool_call_duration_ms`	histogram	`tool_name`, `outcome: ok	error`
`tool_call_total`	counter	`tool_name`, `outcome`	同
`skill_run_duration_ms`	histogram	`skill_id`, `outcome`	Skill 完成（含 prompt + LLM + parse）
`advisor_invoked_total`	counter	`advisor_role`, `coordination_mode`	Advisor Planner
`cognition_step_count`	histogram	`task_type`	每次 task 完成时记录 ReAct 步数
`boundary_guard_reject_total`	counter	`kind: forbidden_field	action_adjacent_redirect
`sensitive_input_classified_total`	counter	`classification: ordinary_preference	financial_context
`kill_switch_triggered_total`	counter	`reason`	kill_switch 触发
`cognition_store_write_duration_ms`	histogram	`object_type`	CognitionStore 写盘
`cognition_store_read_duration_ms`	histogram	`object_type`	同

5.2.3 LLM 层

Metric	Type	Labels	触发
`llm_call_total`	counter	`provider`, `model`, `outcome`	每次 LLM 调用
`llm_call_duration_ms`	histogram	`provider`, `model`	同
`llm_input_tokens`	histogram	`provider`, `model`, `skill_or_advisor_id`	每次 LLM 调用
`llm_output_tokens`	histogram	`provider`, `model`, `skill_or_advisor_id`	同
`llm_tool_tokens`	histogram	`provider`, `model`, `skill_or_advisor_id`	同
`llm_cost_usd_total`	counter	`provider`, `model`, `skill_or_advisor_id`	同（与 §10 / B-7 双向对齐）
`provider_failover_total`	counter	`from_provider`, `to_provider`, `reason`	Provider Router failover

5.2.4 Refresh / Thread 层

Metric	Type	Labels	触发
`thread_refresh_total`	counter	`trigger_type`	thread refresh 完成
`thread_refresh_duration_ms`	histogram	—	同
`snapshot_generated_total`	counter	`ui_state`	snapshot 写盘完成
`checkpoint_generated_total`	counter	—	checkpoint 写盘完成

5.2.5 SSE 层

Metric	Type	Labels	触发
`sse_event_emitted_total`	counter	`event_name`（B-3 §5.2）	每次推送
`sse_event_lag_ms`	histogram	`event_name`	event 产生 → 推送之间的延迟（§9）
`sse_heartbeat_total`	counter	—	60s idle heartbeat

5.3 Flush 策略

in-memory ring：每 30 秒 flush 到 data/metrics/<YYYY-MM-DD>.jsonl；
同时聚合到 data/metrics/index.sqlite（旁路），表结构：

CREATE TABLE metrics_aggregated_5min (
  bucket_ts        TIMESTAMP,           -- 5 分钟 bucket
  metric_name      TEXT,
  labels_json      TEXT,                -- 排序后 canonical JSON
  count            INTEGER,
  sum              REAL,
  p50              REAL,
  p90              REAL,
  p99              REAL,
  PRIMARY KEY (bucket_ts, metric_name, labels_json)
);

进程退出 / SIGTERM 时强制 flush 当前 ring（避免遗失最近 30 秒）；
与 B-2 §3.3 SQLite 的有限使用对齐：SQLite 仅作为聚合索引，可从 JSONL replay 重建。

5.4 默认警报阈值（人工 grep / `make obs-summary` 输出）

信号	阈值	后果
`api_5xx_total` 1 小时	> 5	项目发起人邮件 / Labs 协作群提示（人工）
`boundary_guard_reject_total{kind=sensitive_input_reject}`	> 0	立即 review（与 v1-human-experience-trial-script.md RB-2 联动）
`kill_switch_triggered_total`	≥ 1	立即停止 trial（D-06）
`sse_event_lag_ms p99`	> 5_000	warn → check `agent.runtime` 是否阻塞
`llm_cost_usd_total{provider=*}` 24 小时	接近 daily budget（→ B-7 §5）	warn → 接近 hard cap 时触发降级
`provider_failover_total` 1 小时	> 10	warn → check provider 稳定性
`cognition_store_write_duration_ms p99`	> 1_000	warn → check 磁盘 / 文件锁

阈值由 server/scripts/obs_summary.py 在 make obs-summary 时输出，附带颜色 ASCII；trial 期间项目发起人每天手动跑一次。

6. Trace Schema and Spans

6.1 Trace 对象

复用 B-2 §2.4 Trace + B-2 §5.1 ID 命名：

trace:
  trace_id: "tr_<task_id>"                # 一 task 一份；ID 与 task_id 1:1
  task_id: "task_<iso>_<hash>"
  task_type: "snapshot" | "thread_refresh" | "risk_challenge" | "pre_execution" | ...
  user_anon_id: string                    # §12.2 HMAC
  started_at: datetime
  ended_at: datetime | null
  status: "running" | "completed" | "failed" | "cancelled"
  spans:                                  # append-only span list；§6.2
    - <span>
  schema_version: "1.0"

落盘：data/traces/<task_id>.jsonl（一行一个 span event）；trace header 写在第 0 行；spans 按时间 append。

6.2 Span Schema

span:
  span_id: "sp_<6-char-hash>"
  parent_span_id: string | null           # 嵌套关系（V1 单进程不传跨服务）
  name: string                            # 见 §6.3 vocabulary
  started_at: datetime
  ended_at: datetime | null
  duration_ms: number | null
  status: "ok" | "error"
  attributes:                             # 结构化 key-value（不含 PII / prompt / response）
    <span-specific keys>
  events:                                 # 子事件（轻量）
    - { ts, name, data }

6.3 Span 名词汇表（与 B-3 §5.2 SSE catalog 1:1 对齐）

Span name	对应 SSE event	触发位置	关键 attributes
`task`	`task_started` / `task_completed` / `task_failed`	AgentRuntime 顶层	`task_type`, `route`, `produced_object_refs[]`
`task_routed`	`cognition_step (task_routed)`	Task Router	`task_type_inferred`, `route`
`context_built`	`cognition_step (context_built)`	Context Engine	`prompt_tokens_estimated`, `context_segments_count`
`advisor_planned`	`cognition_step (advisor_planned)`	Advisor Planner	`advisor_count`, `coordination_mode`
`advisor_invoked`	`advisor_invoked`	Advisor Planner	`advisor_role`, `provider_used`
`skill_running`	`cognition_step (skill_running)`	Skill Manager	`skill_id`, `provider`
`tool_called`	`tool_called`	Tool Registry	`tool_name`, `outcome`
`llm_call`	（不出 SSE；纯 trace）	LLM Client	`provider`, `model`, `input_tokens`, `output_tokens`, `cost_usd`
`evidence_audited`	`evidence_updated`	Evidence Checker	`evidence_count`, `claim_ref`
`data_quality_audited`	`data_quality_updated`	同	`quality_count`
`boundary_check`	`boundary_event`（如阻挡）	BoundaryGuard	`kind`, `outcome`
`sensitive_input_check`	`boundary_event (sensitive_input_rejected)`	Sensitive Input Classifier	`classification`, `action`
`object_drafting`	`cognition_step (object_drafting)`	Object Writer 准备	`object_type`
`object_write`	`snapshot_completed` / `checkpoint_completed` / `thread_refreshed`	CognitionStore	`object_type`, `object_id`
`degradation`	`degradation_notice`	AgentRuntime / Advisor Planner	`degradation_kind`, `affected_field`

6.4 Trace 与 SSE 的关系

Trace 是持久化的因果链（写盘永久保留 90 天）；
SSE 是实时的事件流（无持久化，断线即丢）；
同一事件同时写 trace（详细 attributes） + 发 SSE（简化 payload）；
用户能在前端看到 SSE 事件（粗粒度）；工程师能查 trace（细粒度）；
推送 SSE 之前，trace span 必须先 close（保证持久化先于通知）。

6.5 Trace Retention

与 B-2 §7.1 一致：90 天后归档到 data/traces/_archive/<YYYY-MM>/<task_id>.jsonl；
归档不改 ID；
Sensitive task（触发 boundary / sensitive_input）的 trace 额外 link 到 data/sensitive/<user_id>/<input_ref>.json（B-2 §4.1），保留 metadata 可追溯但不含明文。

6.6 `--full-trace` 调试模式

项	默认	dev `--full-trace`
LLM prompt 入 trace	否	是（仅本机；trial-prod 强制关闭）
LLM response 入 trace	否	是（同上）
Tool args 全量入 trace	仅 summary	全量
配置位置	`server/config/app.yaml` `obs.full_trace: false`	启动时 env `FINCLAW_FULL_TRACE=1` 覆盖（仅 local-dev 形态）
Trial-prod 强制	`obs.full_trace=false` 锁定，启动时 assert 检查	—

trial-prod 启动时 server/scripts/obs_preflight.py 检查：若 FINCLAW_FULL_TRACE=1 且 deployment_mode != local-dev → 直接 abort（防止误开启）。

7. Health Endpoint Contract

7.1 Endpoint 形态（与 B-3 §4.2.1 对齐 + down-pour）

GET /api/health         (anonymous)

Response 200 / 503 schema：

response:
  status: "ok" | "degraded" | "unavailable"
  version: string                          # git short sha
  build_time: datetime                     # ISO-8601 UTC
  kill_switch_active: bool                 # data/.kill_switch 是否存在
  uptime_seconds: number
  providers:
    primary:
      id: "gpt-5.5"
      reachable: bool                      # 最近 60 秒探测结果
      last_check_at: datetime
    secondary:
      id: "kimi-k2.6"
      reachable: bool
      last_check_at: datetime
  store:
    data_path: "/app/data"
    writable: bool                          # 写测试文件 + 删除
    free_bytes: number
  obs:
    logs_writable: bool
    metrics_writable: bool
    traces_writable: bool
  schema_version: "1.0"

7.2 Status 语义

status	条件	返回 HTTP
`ok`	全部子项 OK + kill_switch_active=false	200
`degraded`	secondary provider unreachable / store free_bytes < 1 GiB / one of obs.* writable=false / kill_switch_active=false	200
`unavailable`	primary provider unreachable / store writable=false / kill_switch_active=true	503

7.3 Provider Reachability 探测

项	实现
探测协议	OpenAI-compatible `/v1/models` GET（GPT-5.5 / Kimi K2.6 都支持）
探测频率	后台协程每 60 秒 1 次；结果缓存到 in-memory `provider_health`
超时	5 秒；超时算 unreachable
失败计数	连续 3 次 unreachable 才在 `/api/health` 中 mark unreachable（避免单次网络抖动）
Cost	每分钟每 provider 1 次 ping，每月 ≈ 86K 次，cost ≈ 0；不计入 budget

7.4 Disk Space 检查

阈值	行为
free_bytes ≥ 5 GiB	ok
1 GiB ≤ free_bytes < 5 GiB	degraded + 日志 warn
free_bytes < 1 GiB	unavailable + 拒绝新 mutation endpoint（503 + `kill_switch_active` 假激活；运维需手动清理后再启动）

7.5 Liveness vs Readiness

V1 不区分 liveness / readiness（K8s 概念）。/api/health 既用于 docker-compose healthcheck 又用于运维手动 ping。如未来引入 K8s（V2），再分两 endpoint。

8. Error Capture and Alerting (V1 Lite)

8.1 D-04 锁定：V1 不引入 Sentry

D-04 stack lock 已锁定 V1 不引入 Sentry。错误捕获采用以下 V1 Lite 方案：

8.2 Local error.log

项	实现
文件	`logs/error.log`（structured JSON）+ 当前日 `logs/<YYYY-MM-DD>.jsonl` 共写
过滤	structlog filter handler：`level >= error` → 镜像到 `error.log`
内容	同 §4.1 log_record schema + `exception:` 字段（含 `type`, `message`, `traceback_lines[]`）
Stack trace	保留全量；但 filter 去掉任何 `prompt` / `response` / `credential` 字段（§12.4）
大小限制	单文件 > 100 MiB 时 rotate 到 `logs/error.log.<N>`；保留最近 10 个

8.3 Daily Digest

每天人工 02:00 跑 python -m server.scripts.error_digest --date=YYYY-MM-DD：

输出	描述
`logs/digest/<YYYY-MM-DD>.md`	Markdown 报告：错误总数、按 `event` 名分桶、top 10 traceback signature、与前日对比
终端摘要	ASCII 表格：组件 / event / count / sample message
`data/events/system/<YYYY-MM-DD>.jsonl` 同步追加一条 `digest_generated` event	供 Wave 3 commercial signal report 关联（不入 CS 库）

8.4 高危事件即时提示

以下 event 触发后当帧写入 data/events/system/<YYYY-MM-DD>.jsonl（独立于 CS 路径）：

event	触发	后续动作
`kill_switch_triggered`	`data/.kill_switch` 文件被创建	项目发起人立即收到 Labs 协作群文本通知（手动机制）
`boundary_block`	BoundaryGuard 拒绝 forbidden field	进入 trace + error.log；项目发起人每日digest 中复核
`sensitive_input_rejected`	Sensitive Input Classifier 拒收 credential 类输入	同上；事件计数若 1 小时 > 3 → 立即触发 RB-2 流程
`provider_failure_persistent`	全部 provider 失败超 5 分钟	digest + Labs 协作群文本（同 kill_switch）
`cognition_store_write_failed`	atomic write rename 失败	同

8.5 V1 Lite 的限制

限制	解决方式（V2 时机）
无实时 pager / SMS	trial 退出 / 公开扩展前评估接入轻量告警平台（如 OpsGenie / 简易 webhook）
无错误聚类 / dedup	digest 脚本按 traceback signature hash 做近似 dedup；V2 引入 Sentry-like 服务再升级
无错误率 SLO	V1 只看绝对计数；trial 退出后定义 SLO

9. SSE Event Lag Monitoring

9.1 lag 定义

sse_event_lag_ms = sse_pushed_at - event_produced_at

event_produced_at：业务逻辑层产生 SSE event 的时刻（如 snapshot 写盘完成的瞬间）；
sse_pushed_at：SSE writer 写入 socket（yield event\n 之后）的时刻；
测量点位于 server/api/sse.py 的 EventSink writer 协程。

9.2 健康阈值

Event	p50 期望	p99 警戒
`task_started`	< 50 ms	< 200 ms
`cognition_step`	< 50 ms	< 500 ms
`tool_called`	< 50 ms	< 500 ms
`advisor_invoked`	< 50 ms	< 500 ms
`snapshot_completed`	< 100 ms	< 1_000 ms
`task_completed`	< 100 ms	< 1_000 ms
heartbeat	n/a	n/a

超出 p99 警戒 → §5.4 警报阈值表 sse_event_lag_ms p99 > 5000 触发 warn。

9.3 lag 的根因方向

现象	可能根因
所有 event 都滞后	EventSink 队列阻塞 / asyncio loop overload
仅 `tool_called` 滞后	tool 阻塞主协程（应改 async）
仅 `snapshot_completed` 滞后	CognitionStore IO 慢
客户端断线后看不到	非滞后；查 `api_sse_close_total{reason=client_disconnect}`

9.4 与 B-3 §5.4 SSE 边界对齐

nginx proxy_buffering off：避免 buffering 引入假滞后；
60s idle heartbeat：避免假滞后被误判为断连。

10. Cost Telemetry Hook

本 § 为 v1-cost-and-token-budget-design.md B-7 暴露 hook contract；具体 cost 模型、provider unit cost、budget 数值由 B-7 定义。本文只提供「在哪里 / 何时 / 以什么 schema 记录 cost」的工程契约。

10.1 Hook 触发点

LLM Client.complete(...)
  └─ 完成后调用 cost_telemetry.record(
       provider, model, skill_or_advisor_id,
       input_tokens, output_tokens, tool_tokens,
       cost_usd_estimated,
       failover_chain,
       trace_id, span_id, user_anon_id,
       request_id, task_id, route
     )

10.2 Hook 落地

写入项	落点
metrics counter `llm_call_total` / `llm_cost_usd_total`	§5.2.3
metrics histogram `llm_input_tokens` / `llm_output_tokens` / `llm_tool_tokens` / `llm_call_duration_ms`	同
trace span `llm_call` attributes	§6.3
`data/eval/llm_telemetry.jsonl` 一行一次调用	B-1 §8.4 + B-2 §3.3
`data/eval/llm_telemetry.sqlite` 聚合（每天 / 每 task / 每 provider）	同
如触发 budget 警戒 → SSE `degradation_notice` + log warn	§10.4

10.3 `llm_telemetry.jsonl` schema

⚠️ 本节 flat schema 已被 v1-cost-and-token-budget-design.md §10.2 (B-7) cost_telemetry event schema 替代为 canonical 嵌套 schema。

工程实施时必须以 B-7 §10.2 为单一权威（嵌套字段 tokens.prompt_tokens / cost_usd.value / cost_usd.source / budget_impact.user_daily_pct_after / routing_reason 等）。

本节扁平字段保留仅作过渡参考，不作为实现依据；写盘位置以本文 §10.2 hook 落点为准（data/eval/llm_telemetry.jsonl + SQLite 旁路索引），但字段形状以 B-7 §10.2 为准。

# 历史 flat schema（已 superseded → 见 B-7 §10.2 canonical 嵌套 schema）
llm_telemetry_record_legacy:
  ts: datetime
  task_id: string
  trace_id: string
  span_id: string
  user_anon_id: string                     # §12.2
  route: string                            # task_router 推断的 route
  skill_or_advisor_id: string              # 调用者
  provider: "gpt-5.5" | "kimi-k2.6" | "byom:<slug>"
  model: string                            # provider 上报的 model name
  endpoint: string                          # 不含 secret
  input_tokens: number                     # → B-7 §10.2: tokens.prompt_tokens
  output_tokens: number                    # → B-7 §10.2: tokens.completion_tokens
  tool_tokens: number                      # B-7 §10.2 仅暴露 prompt/completion/total；tool tokens 合入 prompt
  total_tokens: number                     # → B-7 §10.2: tokens.total_tokens
  cost_usd_estimated: number               # → B-7 §10.2: cost_usd.value
  cost_usd_source: ...                     # → B-7 §10.2: cost_usd.source
  duration_ms: number                      # B-7 §10.2 未覆盖；保留为 obs 私有字段
  outcome: ...                             # B-7 §10.2 未覆盖；保留为 obs 私有字段
  failover_chain: array                    # → B-7 §10.2: failover_chain
  budget_scope_hits: array                 # 见 §10.4；与 B-7 §10.2 budget_impact + triggered_tier_transition 对齐
  schema_version: "1.0"

10.4 Budget Scope Hits

每次 LLM 调用结束时，cost telemetry 评估以下 budget scope 是否被触达，并标记到 budget_scope_hits[]：

Scope	触达定义（详见 B-7）
`per_task_soft_cap`	累计 task cost 接近 hard cap 80%
`per_task_hard_cap`	超出 task hard cap（B-7 §5）
`per_user_daily_soft_cap`	per-user 累计 1 天 cost 接近 daily cap 80%
`per_user_daily_hard_cap`	超出
`trial_monthly_soft_cap`	整 trial 月度 cost 接近 80%
`trial_monthly_hard_cap`	超出

当出现 *_hard_cap 时：

立即返回 quota_exceeded 错误（B-3 §7.2）；
推送 SSE degradation_notice 含 degradation_kind: budget_hard_cap + affected_field: <scope>；
写 data/events/system/<date>.jsonl 一条 budget_hard_cap_hit event；
counter boundary_guard_reject_total{kind=budget_hard_cap} += 1。

10.5 与 B-7 的双向 hook contract

B-7 侧职责	B-5 侧职责
定义 provider unit cost 表（`config/provider_pricing.yaml`）	读取该表计算 `cost_usd_estimated`
定义 budget caps（per_task / per_user / trial）	在 §10.4 hook 中评估并写 `budget_scope_hits`
定义 overrun 降级路径（缩短上下文 / 跳 advisor / 切 K2.6）	在 `provider_router` 拒绝 → log + metric + SSE 通知
定义 BYOM cost-shift（不计 FinClaw budget）	`cost_usd_estimated=0` if provider startswith `byom:`；标记 `cost_usd_source=byom_self_reported`
定义 trial 月度 / 总预算 alarm threshold	metrics `llm_cost_usd_total` 24h / 30d 滚动聚合 → 阈值告警

任何 B-7 更新（budget 上限调整、provider 切换）只需修改配置（provider_pricing.yaml / budget_caps.yaml）；本文 §10.1 ~ §10.4 hook 不需要重新落地。

11. Local Dev vs Trial vs Production

V1 仅有 3 种部署形态（B-1 §8.1）：

11.1 信号级别差异

信号	local-dev	docker-compose-trial	（V2 prod 预留）
Logs level	`debug`	`info`	`info`（可调）
`--full-trace` LLM prompt/response	允许（默认关）；启动 env `FINCLAW_FULL_TRACE=1`	强制关闭（preflight 检查）	强制关闭
Metrics flush 间隔	30 s	30 s	5 s（V2）
Health endpoint 探测频率	60 s	60 s	30 s（V2）
Error digest	不强制；可按需	每天 02:00 项目发起人手动	自动 cron（V2）
高危事件通知	不强制	Labs 协作群文本（手动）	自动告警（V2）
Traces 写盘	默认开	默认开；90 天后归档	同

11.2 Preflight Check（trial 启动前）

server/scripts/obs_preflight.py 在 trial-prod 启动时按顺序验证：

deployment_mode == "trial-prod" → obs.full_trace=false（assert）；
data/.kill_switch 不存在；
data/ 写权限验证（写 / 读 / 删测试文件）；
logs/ 写权限同上；
primary provider reachable（5 秒探测）；
config/provider_pricing.yaml 与 config/budget_caps.yaml 加载成功 + schema 校验通过；
clock skew 检查（与 OpenAI server header 对比）：偏移 > 60 秒 abort（cost telemetry 依赖准确时钟）。

任一失败 → docker container 启动失败；运维需先解决再启动。

11.3 Eval 模式特殊处理

eval（make eval 模式）：

Logs / metrics / traces 全部开启；
obs.full_trace 默认开（eval 需要看 prompt / response 做 reviewer 校对）；
落点重定向到 evaluation/runs/<timestamp>/...（隔离于 trial 数据，B-1 §8.4）；
与 cost telemetry hook 共享 schema，但 SQLite 写 data/eval/llm_telemetry.sqlite。

12. Privacy in Telemetry

12.1 与 D-12 + B-2 §9 的对齐

V1 在 telemetry 层先行落地以下硬约束（隐私 / 合规复核延后但不豁免）：

约束	落点
永不写明文 user_id 入 logs / metrics / traces	§4.1 `user_anon_id` 字段；§12.2
永不写 LLM raw prompt / response 入 trace（默认）	§6.6 `--full-trace` 默认关闭 + trial-prod 强制
永不写 sensitive_input_* 明文（凭证 / 私钥 / 助记词）	§12.4 字段黑名单 + CI lint
永不写 session_id 明文	§12.2 session_anon_id
sensitive 事件仅入 trace metadata（classification + masked_stub）	§12.5

12.2 user_id / session_id Hash

标识	入库形态
`user_id`（明文，例：`alice` / `bob`）	不入任何 telemetry 文件
`user_anon_id`	HMAC-SHA256(`user_id`, secret_salt) → 16-char hex；写入 logs / metrics / traces；secret_salt 仅在 server config（`server/config/app.yaml` `obs.anon_salt`），不入 git
`session_id`（明文 UUID）	不直接入库
`session_anon_id`	同上 HMAC 派生

salt 旋转：trial 期间不旋转；trial 退出后如需脱敏导出，旋转 salt 重新派生匿名 ID。

12.3 LLM Prompt / Response 边界

信号	默认	例外
metric `llm_input_tokens` / `llm_output_tokens`	计数入库	无
trace span `llm_call.attributes.input_tokens` 等	同	无
trace span `llm_call.attributes.prompt` / `response`（全文）	不入库	仅 dev `--full-trace=1` 且 deployment_mode=local-dev
log record `data.prompt` / `data.response`	不入库	无（任何模式均禁止）

实施位置：server/obs/sanitize.py 提供 redact_llm_payload()，所有写 telemetry 的 path 必须经过。

12.4 字段黑名单（CI lint 守门）

任何 data: {...} payload 不允许包含以下 key（用 ripgrep CI rule 守门）：

prompt, response, llm_input, llm_output, raw_prompt, raw_response,
credential, api_key, private_key, seed_phrase, mnemonic, wallet_address,
secret, password, token, bearer_token,
order_side, order_type, quantity, leverage, auto_execute

server/tests/test_obs_field_blacklist.py 在 CI 中扫描 telemetry sample，命中即 fail。

12.5 Sensitive Input 在 Telemetry 中的形态

当 Sensitive Input Classifier 拒收 credential 类输入时：

记录	内容
trace span `sensitive_input_check`	`attributes: { classification, input_segment_index, action }`（无原文）
log record `event: sensitive_input_rejected`	同上 + `data: { masked_stub: "abc***xyz" }`（仅首尾 3 字符）
metric `boundary_guard_reject_total{kind=sensitive_input_reject}` 递增	—
persistence	`data/sensitive/<user_id>/<input_ref>.json`（B-2 §4.1）；trace 仅引用 `input_ref`，不冗余

项	动作
该 user 的 logs / metrics / traces	保留（不含 PII，retention 跟 §6.5 + B-2 §7.1）
该 user 的 CS events	48h 内 purge（CS 侧职责，B-2 §6.5）
该 user 的 `data/sensitive/<user_id>/...`	跟 ProfileConsent 永久保留（合规证据）
`user_anon_id` 反查表	项目发起人持有；用户 hard-delete 时一并物理删除

工程 telemetry 因不含 PII 不进入 consent 撤回的删除路径，这是与 CS 的关键差异。

12.7 数据流出边界

流向	允许？	备注
→ `evaluation/runs/<timestamp>/...` 同步副本	是	仅 eval 模式；含 anon_id 不含明文
→ Labs 治理库 `evaluation/finclaw/`	是	同上；含脱敏样本
→ 外部 SaaS（Sentry / Datadog 等）	否	D-04 stack lock 已排除
→ 公网 webhook out	否	D-02
→ BYOM provider	否	telemetry 不发送给任何 LLM provider

13. Acceptance

本文满足 V1 工程化 B-5 任务的接收条件：

项	状态
三大 telemetry 类别 logs / metrics / traces 全部覆盖（§3）	是
关键指标完整：route latency / tool_call latency / LLM token usage / cognition step count / SSE event lag / CognitionStore read-write / BoundaryGuard reject / kill_switch（§5）	是
Trace schema 复用 B-2 §3 Trace + 与 B-3 §5 SSE catalog 1:1 词汇表对齐（§6）	是
Health endpoint schema 含 LLM provider reachable / store writable / disk space（§7）	是
错误捕获 V1 Lite：local `logs/error.log` + daily digest + structured JSON，不引入 Sentry（§8）	是
Cost Telemetry Hook 给 B-7 暴露 contract（§10）	是
与 v1-commercial-signal-instrumentation-design.md 边界明确（§1.1）	是
隐私边界硬约束：永不写 sensitive_input_* 明文；用户 ID / session ID hash（§12）	是
Local dev / trial / production 行为差异明确（§11）	是
`--full-trace` 在 trial-prod 强制关闭（§6.6 + §11.2 preflight）	是

14. Open Items

O-1：obs.anon_salt 的具体生成机制（启动时随机 vs 项目发起人持久化） — 待 W-10 工程实现时定；倾向「启动时一次性生成 + 写 data/_internal/obs_salt，项目发起人保管」；
O-2：error digest 每天 02:00 是否需要 cron — V1 倾向人工触发；trial 启动 1 周后评估；
O-3：Provider /v1/models ping 在 BYOM 启用时如何处理（用户 endpoint 可能不实现 /v1/models） — 待 BYOM 接入时定，倾向 fallback 为 OPTIONS request；
O-4：metrics SQLite 是否在 V1 内引入索引（提升 make obs-summary 速度） — 默认按 bucket_ts + metric_name 索引；其他索引按需加；
O-5：是否在 V1 暴露 /api/_admin/metrics endpoint（admin 拉取最近 5 min 聚合） — V1 默认不暴露（grep 文件即可）；trial 期评估；
O-6：BYOM cost 由用户自报，FinClaw 应否拒绝无 cost 上报的 BYOM 调用 — 由 B-7 定义；本文 §10.3 仅留 cost_usd_source=byom_self_reported 字段；
O-7：sensitive_input rejection 计数 1 小时 > 3 阈值是否过严 — trial 启动后调整；
O-8：provider_health cache TTL（当前 60 s 连续 3 次失败才标 unreachable）— 倾向于在 trial 启动后据网络情况调整。

1. Purpose​

1.1 与 CS instrumentation 的边界（重要）​

2. Goals and Non-Goals​

2.1 Goals​

2.2 Non-Goals​

2.3 与既有决策对齐​

3. Telemetry Categories​

3.1 三类信号的边界​

3.2 Stack 强约束​

4. Logs Schema​

4.1 统一日志记录字段​

4.2 强字段约束​

4.3 关键 logger 命名（component 字段）​

4.4 日志输出​

4.5 日志的等级语义​

5. Metrics Catalog​

5.1 Metric 类型​

5.2 必采指标清单（按 v1-agent-orchestration-design.md §2 Layers 8 层）​

5.2.1 API 层​

5.2.2 Agent 层​

5.2.3 LLM 层​

5.2.4 Refresh / Thread 层​

5.2.5 SSE 层​

5.3 Flush 策略​

5.4 默认警报阈值（人工 grep / make obs-summary 输出）​

6. Trace Schema and Spans​

6.1 Trace 对象​

6.2 Span Schema​

6.3 Span 名词汇表（与 B-3 §5.2 SSE catalog 1:1 对齐）​

6.4 Trace 与 SSE 的关系​

6.5 Trace Retention​

6.6 --full-trace 调试模式​

7. Health Endpoint Contract​

7.1 Endpoint 形态（与 B-3 §4.2.1 对齐 + down-pour）​

7.2 Status 语义​

7.3 Provider Reachability 探测​

7.4 Disk Space 检查​

7.5 Liveness vs Readiness​

8. Error Capture and Alerting (V1 Lite)​

8.1 D-04 锁定：V1 不引入 Sentry​

8.2 Local error.log​

8.3 Daily Digest​

8.4 高危事件即时提示​

8.5 V1 Lite 的限制​

9. SSE Event Lag Monitoring​

9.1 lag 定义​

9.2 健康阈值​

9.3 lag 的根因方向​

9.4 与 B-3 §5.4 SSE 边界 对齐​

10. Cost Telemetry Hook​

10.1 Hook 触发点​

10.2 Hook 落地​

10.3 llm_telemetry.jsonl schema​

10.4 Budget Scope Hits​

10.5 与 B-7 的双向 hook contract​

11. Local Dev vs Trial vs Production​

11.1 信号级别差异​

11.2 Preflight Check（trial 启动前）​

11.3 Eval 模式特殊处理​

12. Privacy in Telemetry​

12.1 与 D-12 + B-2 §9 的对齐​

12.2 user_id / session_id Hash​

12.3 LLM Prompt / Response 边界​

12.4 字段黑名单（CI lint 守门）​

12.5 Sensitive Input 在 Telemetry 中的形态​

12.6 用户撤回 consent 的副作用​

12.7 数据流出边界​

13. Acceptance​

14. Open Items​