使用 CloudLens for OSS 构建服务端性能指标可观测实践
CloudLens for OSS
使用场景
CloudLens for OSS,支持 Bucket 粒度的统一管理视图,支持资源用量、访问分析、异常检测、安全分析等可视化分析能力,提供场景化运维管理,实现 Bucket 资产的可观测性。
本文主要介绍如何使用日志服务 CloudLens for OSS 访问明细日志(记录相关 OSS Bucket 的所有访问日志及批量删除日志)结合日志服务定时 SQL,在对 Bucket 操作请求时,对服务端并发请求数,响应延迟,带宽占用等关键指标的实时展现和交互分析。
使用前提
- 开通 CloudLens for OSS 访问明细日志
- 日志服务 SLS 会自动创建专属的 Project(oss-log-阿里云账号 ID-地域 ID)以及 Logstore(oss-log-store),用于存储 OSS Bucket 的访问日志。
日志字段详情:OSS 日志字段详情_日志服务-阿里云帮助中心
- 通过定时 SQL,从 Bucket 原始访问日志中聚合摘要信息,并写入到中心化的 Project/Logstore 中。
注:OSS 原始访问日志是按区域存储在不同 Project 中,所以需分别在每个区域对应 Project 中配置相应定时 SQL。
3.1 服务端指标类实时数据
按分钟纬度聚合服务端请求的 qps, 延迟等数据
* | select
(__time__ - __time__%60) as logtime,
bucket as bucket_name,
owner_id as bucket_owner,
ip_to_province(client_ip) as remote_province,ip_to_provider(client_ip) as remote_provider,
vpc_addr,ip,
sync_request as request_type,
sum(total_qps) as total_qps,
sum(put_qps) as put_qps,
sum(get_qps) as get_qps,
sum(put_throughput) as put_throughput,
sum(get_throughput) as get_throughput,
sum(put_latency) as put_latency,
sum(get_latency) as get_latency,
sum(put_server_latency) as put_server_latency,
sum(get_server_latency) as get_server_latency
from
(select __time__, bucket, owner_id, client_ip, sync_request, vpc_addr,int_to_ip(cast(vpc_addr as bigint)) as ip,
case when http_method in ('GET', 'HEAD', 'PUT', 'POST') then 1 else 0 end as total_qps,
case when http_method in ('GET', 'HEAD') then 1 else 0 end as get_qps,
case when http_method in ('PUT', 'POST') then 1 else 0 end as put_qps,
case when http_method in ('GET', 'HEAD') then response_body_length else 0 end as get_throughput,
case when http_method in ('PUT', 'POST') then request_length else 0 end as put_throughput,
case when http_method in ('GET', 'HEAD') then response_time else 0 end as get_latency,
case when http_method in ('PUT', 'POST') then response_time else 0 end as put_latency,
case when http_method in ('GET', 'HEAD') then server_cost_time else 0 end as get_server_latency,
case when http_method in ('PUT', 'POST') then server_cost_time else 0 end as put_server_latency from log )
group by logtime, bucket_name, bucket_owner,
remote_province, vpc_addr, request_type,remote_provider,ip limit 1000000
3.2 服务端指标类实时数据_read
基于访问日志统计读请求的 p95,p99,请求速率,带宽等指标
* and (http_method:GET or http_method:HEAD)
|select (__time__ - __time__%60) as logtime,
bucket as bucket_name,
owner_id as bucket_owner, ip_to_province(client_ip) as remote_province,
qps, get_throughput as throughput,
case
when get_throughput<=102400 then 'size_0_100k'
when get_throughput>102400 and get_throughput<=204800 then 'size_100k_200k'
when get_throughput>204800 and get_throughput<=1048576 then 'size_200k_1m'
when get_throughput>1048576 then 'size_1m_more' else 'other'
end as object_size_range,
'read' as request_type,
CAST(avg(response_time) as BIGINT) as rl_avg,
approx_percentile(response_time, 0.95) as rl_p95,
approx_percentile(response_time, 0.99) as rl_p99,
case when response_time<=0 then get_throughput else get_throughput/response_time end as speed
from
(select __time__, bucket, owner_id, response_time,client_ip,
case when http_method in ('GET', 'HEAD') then 1 else 0 end as qps,
case when http_method in ('GET', 'HEAD') then response_body_length else 0 end as get_throughput from log )
group by logtime, bucket_name,bucket_owner,object_size_range,qps,speed,request_type,throughput,remote_province limit 10000000
3.3 服务端指标类实时数据_write
基于访问日志统计写请求的 p95,p99,请求速率,带宽等指标。
* and (http_method:PUT or http_method:POST)
|select (__time__ - __time__%60) as logtime,
bucket as bucket_name,
owner_id as bucket_owner, ip_to_province(client_ip) as remote_province,
qps,
case
when put_throughput<=102400 then 'size_0_100k'
when put_throughput>102400 and put_throughput<=204800 then 'size_100k_200k'
when put_throughput>204800 and put_throughput<=1048576 then 'size_200k_1m'
when put_throughput>1048576 then 'size_1m_more' else 'other'
end as object_size_range,
'write' as request_type, put_throughput as throughput,
CAST(avg(response_time) as BIGINT) as rl_avg,
approx_percentile(response_time, 0.95) as rl_p95,
approx_percentile(response_time, 0.99) as rl_p99,
case when response_time<=0 then put_throughput else put_throughput/response_time end as speed
from
(select __time__, bucket, owner_id, response_time,client_ip,
case when http_method in ('PUT', 'POST') then 1 else 0 end as qps,
case when http_method in ('PUT', 'POST') then request_length else 0 end as put_throughput from log )
group by logtime, bucket_name, bucket_owner,object_size_range,qps,speed,request_type,throughput,remote_province limit 1000000
定时 SQL 按分钟汇聚数据后结果
样例 1:
样例 2:
服务端-延迟监控分析大盘
用户请求带宽概览
实时上行、下行带宽及按运营商和访问省份统计信息
用户请求速率分析
用户请求速率时序图,及按文件对象大小请求速率分布
用户请求延迟分析
按不同来源身份请求延迟分布
附录
监控面板—SQL 样例
面板名称 | SQL 样例 |
---|---|
上行 P95 | (*)| select approx_percentile(get_throughput, 0.95) as get_p95, approx_percentile(put_throughput, 0.95) as put_p95 from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime) |
下行 P95 | (*)| select approx_percentile(get_throughput, 0.95) as get_p95, approx_percentile(put_throughput, 0.95) as put_p95 from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime) |
上行峰值 | (*)| select round(max(get_throughput), 3) as get_max, round(max(put_throughput), 3) as put_max from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime) |
下行峰值 | (*)| select round(max(get_throughput), 3) as get_max, round(max(put_throughput), 3) as put_max from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime) |
运营商分布(Gb/s) | (*)| select key, approx_percentile(max_throughput, 0.95) as max_p95 from (select (logtime-logtime%300) as logtime, remote_provider as key, case when sum(get_throughput)>sum(put_throughput) then sum(get_throughput)*8/300 else sum(put_throughput)*8/300 end as max_throughput from log group by logtime, key) group by key order by max_p95 desc |
来源省份 | (*)| select key, approx_percentile(max_throughput, 0.95) as max_p95 from (select (logtime-logtime%300) as logtime, remote_province as key, case when sum(get_throughput)>sum(put_throughput) then sum(get_throughput)*8/300 else sum(put_throughput)*8/300 end as max_throughput from log group by logtime, key) group by key order by max_p95 desc |
带宽趋势 | (*)| select date_format(date_trunc('minute', (logtime-logtime%300)), '%Y-%m-%d %H:%i:%m') as logtime, sum(get_throughput)*8/300 as "下行带宽" , sum(put_throughput)*8/300 as "上行带宽" from log group by logtime order by logtime limit 1000000 |
请求速率 | (* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute',logtime), '%Y-%m-%d %H:%i:%m') as logtime, avg(speed)*8000 as speed from log group by logtime order by logtime limit 1000000 |
请求大小速率分布 | (* and request_type:read and object_size_range:size_0_100k)| select object_size_range as "文件大小", avg(speed)*8000 as "速率" from log group by object_size_range order by "速率" desc |
读请求延迟趋势 | ()| select l.remote_province as remote_province, r.remote_province as province, date_format(date_trunc('minute', logtime), '%Y-%m-%d %H:%i:%m') as logtime, l.latency from (select logtime, remote_province, avg(get_latency) as latency from log group by logtime, remote_province ) as l left join (select remote_province, latency, qps, qps100.0/total as rate from (select remote_province, avg(get_latency) as latency, sum(get_qps) as qps from log group by remote_province) as a cross join (select sum(get_qps) as total from log) as b having rate>0.1 order by latency desc limit 5) as r on l.remote_province=r.remote_province having province is not null order by logtime limit 1000000 |
请求延迟 P95 | (* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute', (logtime)), '%Y-%m-%d %H:%i:%m') as logtime, avg(rl_p95) as rl_p95 from log group by logtime order by logtime limit 1000000 |
请求延迟 P99(Top 5 请求省份) | (* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute', logtime), '%Y-%m-%d %H:%i:%m') as logtime, remote_province , avg(rl_p99) as latency from log where remote_province in (select remote_province from (select remote_province , sum(qps) as qps from log group by remote_province order by qps desc limit 5)) and remote_province is not null group by logtime, remote_province |