Skip to content

使用 CloudLens for OSS 构建服务端性能指标可观测实践

CloudLens for OSS

试用 Demo

使用场景

CloudLens for OSS,支持 Bucket 粒度的统一管理视图,支持资源用量、访问分析、异常检测、安全分析等可视化分析能力,提供场景化运维管理,实现 Bucket 资产的可观测性。

本文主要介绍如何使用日志服务 CloudLens for OSS 访问明细日志(记录相关 OSS Bucket 的所有访问日志及批量删除日志)结合日志服务定时 SQL,在对 Bucket 操作请求时,对服务端并发请求数,响应延迟,带宽占用等关键指标的实时展现和交互分析。 image

使用前提

  1. 开通 CloudLens for OSS  访问明细日志

image

  1. 日志服务 SLS 会自动创建专属的 Project(oss-log-阿里云账号 ID-地域 ID)以及 Logstore(oss-log-store),用于存储 OSS Bucket 的访问日志。

日志字段详情:OSS 日志字段详情_日志服务-阿里云帮助中心

  1. 通过定时 SQL,从 Bucket 原始访问日志中聚合摘要信息,并写入到中心化的 Project/Logstore 中。

注:OSS 原始访问日志是按区域存储在不同 Project 中,所以需分别在每个区域对应 Project 中配置相应定时 SQL。

3.1  服务端指标类实时数据

按分钟纬度聚合服务端请求的 qps,  延迟等数据

    * | select
    (__time__ - __time__%60) as logtime,
    bucket as bucket_name,
    owner_id as bucket_owner,
    ip_to_province(client_ip) as remote_province,ip_to_provider(client_ip) as remote_provider,
    vpc_addr,ip,
    sync_request as request_type,
    sum(total_qps) as total_qps,
    sum(put_qps) as put_qps,
    sum(get_qps) as get_qps,
    sum(put_throughput) as put_throughput,
    sum(get_throughput) as get_throughput,
    sum(put_latency) as put_latency,
    sum(get_latency) as get_latency,
    sum(put_server_latency) as put_server_latency,
    sum(get_server_latency) as get_server_latency
    from
    (select __time__, bucket, owner_id, client_ip, sync_request, vpc_addr,int_to_ip(cast(vpc_addr as bigint)) as ip,
     case when http_method in ('GET', 'HEAD', 'PUT', 'POST') then 1 else 0 end as total_qps,
     case when http_method in ('GET', 'HEAD') then 1 else 0 end as get_qps,
     case when http_method in ('PUT', 'POST') then 1 else 0 end as put_qps,
     case when http_method in ('GET', 'HEAD') then response_body_length else 0 end as get_throughput,
     case when http_method in ('PUT', 'POST') then request_length else 0 end as put_throughput,
     case when http_method in ('GET', 'HEAD') then response_time else 0 end as get_latency,
     case when http_method in ('PUT', 'POST') then response_time else 0 end as put_latency,
     case when http_method in ('GET', 'HEAD') then server_cost_time else 0 end as get_server_latency,
     case when http_method in ('PUT', 'POST') then server_cost_time else 0 end as put_server_latency from log )
    group by logtime, bucket_name, bucket_owner,
    remote_province, vpc_addr, request_type,remote_provider,ip limit 1000000

3.2  服务端指标类实时数据_read

基于访问日志统计读请求的 p95,p99,请求速率,带宽等指标

     * and (http_method:GET or http_method:HEAD)
     |select (__time__ - __time__%60) as logtime,
    bucket as bucket_name,
    owner_id as bucket_owner, ip_to_province(client_ip) as remote_province,
    qps, get_throughput as throughput,
    case
    when get_throughput<=102400 then 'size_0_100k'
    when get_throughput>102400 and get_throughput<=204800 then 'size_100k_200k'
    when get_throughput>204800 and get_throughput<=1048576 then 'size_200k_1m'
    when get_throughput>1048576 then 'size_1m_more' else 'other'
    end as object_size_range,
    'read' as request_type,
    CAST(avg(response_time) as BIGINT) as rl_avg,
    approx_percentile(response_time, 0.95) as rl_p95,
    approx_percentile(response_time, 0.99) as rl_p99,
    case when response_time<=0 then get_throughput else get_throughput/response_time end as speed
    from
    (select __time__, bucket, owner_id, response_time,client_ip,
    case when http_method in ('GET', 'HEAD') then 1 else 0 end as qps,
    case when http_method in ('GET', 'HEAD') then response_body_length else 0 end as get_throughput from log )
    group by logtime, bucket_name,bucket_owner,object_size_range,qps,speed,request_type,throughput,remote_province limit 10000000

3.3  服务端指标类实时数据_write

基于访问日志统计写请求的 p95,p99,请求速率,带宽等指标。

     * and (http_method:PUT or http_method:POST)
     |select (__time__ - __time__%60) as logtime,
    bucket as bucket_name,
    owner_id as bucket_owner, ip_to_province(client_ip) as remote_province,
    qps,
    case
    when put_throughput<=102400 then 'size_0_100k'
    when put_throughput>102400 and put_throughput<=204800 then 'size_100k_200k'
    when put_throughput>204800 and put_throughput<=1048576 then 'size_200k_1m'
    when put_throughput>1048576 then 'size_1m_more' else 'other'
    end as object_size_range,
    'write' as request_type, put_throughput as throughput,
    CAST(avg(response_time) as BIGINT) as rl_avg,
    approx_percentile(response_time, 0.95) as rl_p95,
    approx_percentile(response_time, 0.99) as rl_p99,
    case when response_time<=0 then put_throughput else put_throughput/response_time end as speed
    from
    (select __time__, bucket, owner_id, response_time,client_ip,
    case when http_method in ('PUT', 'POST') then 1 else 0 end as qps,
    case when http_method in ('PUT', 'POST') then request_length else 0 end as put_throughput from log )
    group by logtime, bucket_name, bucket_owner,object_size_range,qps,speed,request_type,throughput,remote_province limit 1000000

定时 SQL 按分钟汇聚数据后结果

样例 1:

image

样例 2:

image

服务端-延迟监控分析大盘

用户请求带宽概览

实时上行、下行带宽及按运营商和访问省份统计信息

image

用户请求速率分析

用户请求速率时序图,及按文件对象大小请求速率分布

image

用户请求延迟分析

按不同来源身份请求延迟分布

image

附录

监控面板—SQL 样例

面板名称SQL 样例
上行 P95(*)| select approx_percentile(get_throughput, 0.95) as get_p95, approx_percentile(put_throughput, 0.95) as put_p95 from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime)
下行 P95(*)| select approx_percentile(get_throughput, 0.95) as get_p95, approx_percentile(put_throughput, 0.95) as put_p95 from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime)
上行峰值(*)| select round(max(get_throughput), 3) as get_max, round(max(put_throughput), 3) as put_max from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime)
下行峰值(*)| select round(max(get_throughput), 3) as get_max, round(max(put_throughput), 3) as put_max from (select (logtime-logtime%300) as logtime, sum(get_throughput)*8/300 as get_throughput , sum(put_throughput)*8/300 as put_throughput from log group by logtime)
运营商分布(Gb/s)(*)| select key, approx_percentile(max_throughput, 0.95) as max_p95 from (select (logtime-logtime%300) as logtime, remote_provider as key, case when sum(get_throughput)>sum(put_throughput) then sum(get_throughput)*8/300 else sum(put_throughput)*8/300 end as max_throughput from log group by logtime, key) group by key order by max_p95 desc
来源省份(*)| select key, approx_percentile(max_throughput, 0.95) as max_p95 from (select (logtime-logtime%300) as logtime, remote_province as key, case when sum(get_throughput)>sum(put_throughput) then sum(get_throughput)*8/300 else sum(put_throughput)*8/300 end as max_throughput from log group by logtime, key) group by key order by max_p95 desc
带宽趋势(*)| select date_format(date_trunc('minute', (logtime-logtime%300)), '%Y-%m-%d %H:%i:%m') as logtime, sum(get_throughput)*8/300 as "下行带宽" , sum(put_throughput)*8/300 as "上行带宽" from log group by logtime order by logtime limit 1000000
请求速率(* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute',logtime), '%Y-%m-%d %H:%i:%m') as logtime, avg(speed)*8000 as speed from log group by logtime order by logtime limit 1000000
请求大小速率分布(* and request_type:read and object_size_range:size_0_100k)| select object_size_range as "文件大小", avg(speed)*8000 as "速率" from log group by object_size_range order by "速率" desc
读请求延迟趋势()| select l.remote_province as remote_province, r.remote_province as province, date_format(date_trunc('minute', logtime), '%Y-%m-%d %H:%i:%m') as logtime, l.latency from (select logtime, remote_province, avg(get_latency) as latency from log group by logtime, remote_province ) as l left join (select remote_province, latency, qps, qps100.0/total as rate from (select remote_province, avg(get_latency) as latency, sum(get_qps) as qps from log group by remote_province) as a cross join (select sum(get_qps) as total from log) as b having rate>0.1 order by latency desc limit 5) as r on l.remote_province=r.remote_province having province is not null order by logtime limit 1000000
请求延迟 P95(* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute', (logtime)), '%Y-%m-%d %H:%i:%m') as logtime, avg(rl_p95) as rl_p95 from log group by logtime order by logtime limit 1000000
请求延迟 P99(Top 5 请求省份)(* and request_type:read and object_size_range:size_0_100k)| select date_format(date_trunc('minute', logtime), '%Y-%m-%d %H:%i:%m') as logtime, remote_province , avg(rl_p99) as latency from log where remote_province in (select remote_province from (select remote_province , sum(qps) as qps from log group by remote_province order by qps desc limit 5)) and remote_province is not null group by logtime, remote_province