Skip to content

使用 CloudLens for SLS 监控 Project 资源配额最佳实践

CloudLens for SLS

试用 Demo

背景介绍

Alibaba Cloud Lens  基于  SLS  构建统一云产品可观测能力,支持一键开启实例日志(重要日志、详细日志、作业运行日志)和全局日志(审计日志、计费日志、错误日志、监控指标)的采集功能。

日志分类子分类监控场景说明
实例日志详细日志(收费)访问流量监控 访问异常监控
重要日志(免费)消费组监控 Logtail 采集监控
作业运行日志(免费)数据加工(新版)监控 定时 SQL 任务监控
全局日志审计日志(免费)资源操作监控
错误日志(免费)额度超限监控 访问异常监控 操作异常监控
监控指标(免费)访问流量监控 访问异常监控 资源配额水位监控
计费日志(免费)资源用量跟踪

各日志说明参考 CloudLens 日志索引表:https://help.aliyun.com/document_detail/456901.html?spm=a2c4g.456864.0.0.e979723c8We7zA

使用场景

本文主要介绍如何使用 CloudLens for SLS 中全局错误日志、监控指标做 Project  资源配额的水位监控 、超限监控  以及   如何提交资源配额提升申请。

使用前提

  1. 开通 CloudLens for SLS  以及全局错误日志、监控指标

image

  1. 全局监控日志需存储在同一个 Project 下

为了构建实时资源配额水位监控,全局日志的几种监控日志(错误日志、指标监控)需存放在相同的 Project下。同时为了避免监控日志存放在业务 Project 导致监控占用 Project 的 Quota,可直接挑选一个固定地域的目标 Project,如杭州地域:log-service-{用户 ID}-cn-hangzhou。

CloudLens for SLS  额度监控大盘

资源配额预警概览

报表提供资源配额预警概览  (水位超过 80%)以及   额度超限分布

image

Project 重点资源配额实时水位详情

包含 Project 部分基础资源配额以及数据读写资源配额的实时水位详情

image

image

image

Project 资源配额超限详情

image

监控实践

  1. 额度监控监控项分类说明:
分类监控项说明
实时水位监控基础资源配额水位监控_ 监控 Project  内 LogStore 数、机器组数、Logtail 采集配置水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
数据读写配额水位监控_ 监控 Project 写入流量、Project 写入次数超配额次数 _ 依赖时序库:internal-monitor-metric
额度超限监控资源配额超限次数监控_ 监控基础配额、数据读写超配额次数 _ 依赖日志库:internal-error_log
  1. 高级监控项细分说明如下:
分类场景监控项说明
基础资源配额LogStore实时水位监控_ 监控 Project 下 LogStore 数水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
额度超限监控_ 监控 Project 下 LogStore 数超配额次数 _ 依赖日志库:internal-error_log
机器组水位监控_ 监控 Project 下机器组数水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
额度超限监控_ 监控 Project 下机器组数超配额次数 _ 依赖日志库:internal-error_log
Logtail 采集配置水位监控_ 监控 Project 下 Logtail 采集配置数水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
额度超限监控_ 监控 Project 下 Logtail 采集配置数超配额次数 _ 依赖日志库:internal-error_log
数据读写资源配额Project 写入流量水位监控_ 监控 Project 写入流量水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
额度超限监控_ 监控 Project 写入流量超配额次数 _ 依赖日志库:internal-error_log
Project 写入次数水位监控_ 监控 Project 写入次数水位是否超阈值预期百分比 _ 依赖时序库:internal-monitor-metric
额度超限监控_ 监控 Project 写入次数超配额次数 _ 依赖日志库:internal-error_log

基础监控

基础资源配额水位监控

1、确认告警 SQL:15min 定时检查 LogStore 数、机器组数、Logtail 采集配置水位是否达到告警阈值。

image

注意:查询 SQL 默认返回 100 条数据,如需返回全部查询结果 SQL 结尾可添加  limit 1000,代表可返回 1000 条查询结果

查询 SQL 如下:(注意:告警只能对比结果中最多 1000 条数据是否满足告警条件,建议告警 SQL 内先针对水位做下筛选,比如此处 logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80 )

    * | select Project, region, logstore_ratio, machine_group_ratio, logtail_config_ratio from
    (SELECT A.id as Project , A.region as region,
    round(COALESCE(SUM(B.count_logstore), 0)/cast(json_extract(A.quota, '$.logstore') as double) * 100, 3)  as logstore_ratio,  cast(json_extract(A.quota, '$.logstore') as double) as quota_logstore,
    round(COALESCE(SUM(C.count_machine_group), 0)/cast(json_extract(A.quota, '$.machine_group') as double) * 100, 3)  as machine_group_ratio, cast(json_extract(A.quota, '$.machine_group') as double) as quota_machine_group,
    round(COALESCE(SUM(D.count_logtail_config), 0)/cast(json_extract(A.quota, '$.config') as double) * 100, 3)  as logtail_config_ratio, cast(json_extract(A.quota, '$.config') as double) as quota_logtail_config
    FROM  "resource.sls.cmdb.project" as A
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_logstore
      FROM "resource.sls.cmdb.logstore" as B
      GROUP BY project
    ) AS B ON A.id = B.project
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_machine_group
      FROM "resource.sls.cmdb.machine_group" as C
      GROUP BY project
    ) AS C ON A.id = C.project
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_logtail_config
      FROM "resource.sls.cmdb.logtail_config" as D
      GROUP BY project
    ) AS D ON A.id = D.project
    group by  A.id, A.quota, A.region)
    where quota_logstore is not null and quota_machine_group is not null and quota_logtail_config is not null  and (logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80) limit 10000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的 LogStore 数、机器组数、Logtail 采集配置其中一个水位超过额度的 90%时告警级别为严重
  • 当有 Project 的 LogStore 数、机器组数、Logtail 采集配置其中一个水位超过额度的 80%时告警级别为中

image

数据读写配额水位监控

1、确认告警 SQL:每分钟定时检查 Project 写入流量、写入次数水位是否达到告警阈值。

image

注意:查询 SQL 默认返回 100 条数据,如需返回全部查询结果 SQL 结尾可添加  limit 1000,代表可返回 1000 条查询结果

查询 SQL:(注意:告警只能对比结果中最多 1000 条数据是否满足告警条件,建议告警 SQL 内先针对写入流量/写入次数做下筛选,比如此处 where inflow_ratio > 80 or write_cnt_ratio > 80 )

    (*)| select Project, region, inflow_ratio, write_cnt_ratio from (SELECT cmdb.id as Project, cmdb.region as region, round(COALESCE(M.name1,0)/round(cast(json_extract(cmdb.quota, '$.inflow_per_min') as double)/1000000000, 3) * 100, 3) as inflow_ratio, round(COALESCE(M.name2,0)/cast(json_extract(cmdb.quota, '$.write_cnt_per_min') as double) * 100, 3) as write_cnt_ratio
     from "resource.sls.cmdb.project" as cmdb
    LEFT JOIN (
         SELECT element_at(__labels__, 'project') as project,
     round(MAX(CASE WHEN __name__ = 'logstore_origin_inflow_bytes' THEN __value__ ELSE NULL END)/1000000000, 3) AS name1,
     MAX(CASE WHEN __name__ = 'logstore_write_count' THEN __value__ ELSE NULL END) AS name2
      FROM "internal-monitor-metric.prom" where __name__ in ('logstore_origin_inflow_bytes','logstore_write_count') and regexp_like(element_at( split_to_map(__labels__, '|', '#$#') , 'project') , '.*') group by project) AS M ON cmdb.id = M.project) where inflow_ratio > 80 or write_cnt_ratio > 80  limit 10000

2、告警配置

查询区间选择相对 5 分钟,依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的 Project 写入流量、写入次数其中一个水位超过额度的 90%时告警级别为严重
  • 当有 Project 的 Project 写入流量、写入次数其中一个水位超过额度的 80%时告警级别为中

image

image

资源配额超限次数监控

1、确认告警 SQL:15min 定时检查是否有额度超限发生。

image

查询 SQL:

    ((* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed or ErrorCode: ShardWriteQuotaExceed or ErrorCode: ShardReadQuotaExceed)))| SELECT Project,
    CASE
    WHEN ErrorMsg like '%Project write quota exceed: inflow%' then 'Project写入流量超限'
    WHEN ErrorMsg like '%Project write quota exceed: qps%' then 'Project写入次数超限'
    WHEN ErrorMsg like '%dashboard quota exceed%' then '报表额度超限'
    WHEN ErrorMsg like '%config count%' then 'Logtail采集配置超限'
    WHEN ErrorMsg like '%machine group count%' then '机器组超限'
    WHEN ErrorMsg like '%Alert count %' then '告警超限'
    WHEN ErrorMsg like '%logstore count %' then 'LogStore数超限'
    WHEN ErrorMsg like '%shard count%' then 'Shard数超限'
    WHEN ErrorMsg like '%shard write bytes%' then 'Shard写入超限'
    WHEN ErrorMsg like '%shard write quota%' then 'Shard写入超限'
    WHEN ErrorMsg like '%user can only run%' then 'SQL分析操作并发数超限'
        ELSE ErrorMsg
      END AS ErrorMsg,
    COUNT(1) AS count GROUP BY Project, ErrorMsg Limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有任意额度超限 10 次错误告警级别为严重

  • 当有任意额度发生超限 1 次错误时告警级别为中

image

高级监控

以下是基础监控的细分项,一般情况下不需要,如果需更精细的告警监控,可以参考。

LogStore 监控

水位监控

1、确认告警 SQL:15min 定时检查 LogStore 数水位是否达到告警阈值。

注意:查询 SQL 默认返回 100 条数据,如需返回全部查询结果 SQL 结尾可添加  limit 1000,代表可返回 1000 条查询结果

image

查询 SQL:

    * |  select Project, region, round(count_logstore/quota_logstore * 100, 3)  as logstore_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logstore), 0)  AS count_logstore , cast(json_extract(A.quota, '$.logstore') as double)  as quota_logstore
    FROM  "resource.sls.cmdb.project" as A
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_logstore
      FROM "resource.sls.cmdb.logstore" as B
      GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where  quota_logstore is not null   order by logstore_ratio desc limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的 LogStore 数超过额度的 90%时告警级别为严重
  • 当有 Project 的 LogStore 数超过额度的 80%时告警级别为中

此处需注意,告警触发条件配置多个时,判断顺序是从上至下,因此 logstore_ratio>90 需配置在 logstore_ratio>80 的上面。

image

超限监控

1、确认告警 SQL:15min 定时检查 LogStore 是否发生超限现象。

image

查询 SQL:

    * and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like '%logstore count %' GROUP BY Project ORDER BY count DESC LIMIT 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的 LogStore 发生超限 10 次错误告警级别为严重
  • 当有 Project 的 LogStore 发生超限 1 次错误时告警级别为中

image

机器组监控

水位监控

1、确认告警 SQL:15min 定时检查机器组数水位是否达到告警阈值。

注意:查询 SQL 默认返回 100 条数据,如需返回全部查询结果 SQL 结尾可添加  limit 1000,代表可返回 1000 条查询结果

image

查询 SQL:

    * |  select Project, region, round(count_machine_group/quota_machine_group * 100, 3)  as machine_group_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_machine_group), 0)  AS count_machine_group , cast(json_extract(A.quota, '$.machine_group') as double)  as quota_machine_group
    FROM  "resource.sls.cmdb.project" as A
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_machine_group
      FROM "resource.sls.cmdb.machine_group" as B
      GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where  quota_machine_group is not null   order by machine_group_ratio desc limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的机器组超过额度的 90%时告警级别为严重
  • 当有 Project 的机器组超过额度的 80%时告警级别为中

image

超限监控

1、确认告警 SQL:15min 定时检查机器组是否发生超限现象。

image

查询 SQL:

    * and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like '%machine group count%' GROUP BY Project ORDER BY count DESC LIMIT 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 的机器组发生超限 10 次错误告警级别为严重

*  当有 Project 的机器组发生超限 1 次错误时告警级别为中

image

Logtail 采集配置

水位监控

1、确认告警 SQL:15min 定时检查 Logtail 采集配置数水位是否达到告警阈值。

注意:查询 SQL 默认返回 100 条数据,如需返回全部查询结果 SQL 结尾可添加  limit 1000,代表可返回 1000 条查询结果

image

查询 SQL:

    * |  select Project, region, round(count_logtail_config/quota_logtail_config * 100, 3)  as logtail_config_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logtail_config), 0)  AS count_logtail_config , cast(json_extract(A.quota, '$.config') as double)  as quota_logtail_config
    FROM  "resource.sls.cmdb.project" as A
    LEFT JOIN (
      SELECT project, COUNT(*) AS count_logtail_config
      FROM "resource.sls.cmdb.logtail_config" as B
      GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where  quota_logtail_config is not null   order by logtail_config_ratio desc limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

  • 当有 Project 的 Logtail 采集配置数超过额度的 90%时告警级别为严重
  • 当有 Project 的 Logtail 采集配置数超过额度的 80%时告警级别为中

image

超限监控

1、确认告警 SQL:15min 定时检查 LogStore 是否发生超限现象。

image

查询 SQL:

    * and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like '%config count%' GROUP BY Project ORDER BY count DESC LIMIT 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 的 Logtail 采集配置发生超限 10 次错误告警级别为严重

*  当有 Project 的 Logtail 采集配置发生超限 1 次错误时告警级别为中

image

Project 写入流量监控

水位监控

1、确认告警 SQL:每分钟定时检查相对 5 分钟内 Project 写入流量水位是否达到告警阈值。

image

SQL 详情:

    (*)| SELECT Project, region , round(count_inflow/cast(quota_inflow as double) * 100, 3) as inflow_ratio
    FROM
    (SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_inflow, round(cast(json_extract(cmdb.quota, '$.inflow_per_min') as double)/1000000000, 3) as quota_inflow  from "resource.sls.cmdb.project" as cmdb
    LEFT JOIN (
         SELECT element_at(__labels__, 'project') as project,
     round(MAX(CASE WHEN __name__ = 'logstore_origin_inflow_bytes' THEN __value__ ELSE NULL END)/1000000000, 3) AS name1
      FROM "internal-monitor-metric.prom" where __name__ ='logstore_origin_inflow_bytes' and regexp_like(element_at( split_to_map(__labels__, '|', '#$#') , 'project') , '.*') group by project) AS M ON cmdb.id = M.project )order by inflow_ratio desc  limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 写入流量超过额度的 90%时告警级别为严重

  • 当有 Project 写入流量超过额度的 80%时告警级别为中

image

image

超限监控

1、确认告警 SQL:15min 定时检查 Project 写入流量是否发生超限现象。

image

查询 SQL:

    * and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like '%Project write quota exceed: inflow%' GROUP BY Project ORDER BY count DESC LIMIT 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 写入流量发生超限 10 次错误告警级别为严重

*  当有 Project 写入流量发生超限 1 次错误时告警级别为中

image

Project 写入次数监控

水位监控

1、确认告警 SQL:每分钟定时检查相对 5 分钟内 Project 写入次数水位是否达到告警阈值。

image

查询 SQL:

    (*)| SELECT Project, region,  round(count_write_cnt/cast(quota_write_cnt as double) * 100, 3) as write_cnt_ratio
    FROM
    (SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_write_cnt,
    cast(json_extract(cmdb.quota, '$.write_cnt_per_min') as bigint) as quota_write_cnt from "resource.sls.cmdb.project" as cmdb
    LEFT JOIN (
         SELECT element_at(__labels__, 'project') as project,
      MAX(CASE WHEN __name__ = 'logstore_write_count' THEN __value__ ELSE NULL END) AS name1
      FROM "internal-monitor-metric.prom" where __name__ ='logstore_write_count' and regexp_like(element_at( split_to_map(__labels__, '|', '#$#') , 'project') , '.*') group by project) AS M ON cmdb.id = M.project ) order by write_cnt_ratio desc  limit 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 写入次数超过额度的 90%时告警级别为严重

*  当有 Project 写入次数超过额度的 80%时告警级别为中

image

image

超限监控

1、确认告警 SQL:15min 定时检查 Project 写入次数是否发生超限现象。

image

查询 SQL:

    * and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like '%Project write quota exceed: qps%' GROUP BY Project ORDER BY count DESC LIMIT 1000

2、告警配置

依据业务场景配置告警触发条件、以及告警策略:

*  当有 Project 写入次数发生超限 10 次错误告警级别为严重

*  当有 Project 写入次数发生超限 1 次错误时告警级别为中

image

资源配额调整申请

操作步骤

  1. 登录日志服务控制台
  2. 在 Project 列表区域,单击目标 Project。
  3. 单击image图标。
  4. 单击资源配额对应的管理

image

  1. 资源配额面板中,调整目标资源的配额,然后单击保存

image