diff --git a/doc/source/internal/apimon_training/alerts.rst b/doc/source/internal/apimon_training/alerts.rst index d110994..789e5f9 100644 --- a/doc/source/internal/apimon_training/alerts.rst +++ b/doc/source/internal/apimon_training/alerts.rst @@ -4,7 +4,7 @@ Alerts Alerta is the component of the ApiMon that is designed to integrate alerts from multiple sources. It supports many different standard sources like Syslog, -SNMP, Prometheus, Nagios, Zabbix, etc. Additioanlly any other type of source +SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source using URL request or command line can be integrated as well. Native functions like correlation and de-duplication help to manage thousands of @@ -12,10 +12,10 @@ alerts in transparent way and consolidate alerts in proper categories based on environment, service, resource, failure type, etc. Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ . -The authentication is centrally managed by LDAP. +The authentication is centrally managed by OTC LDAP. The Zulip API was integrated with Alerta, to send notification of errors/alerts -on zulip stream. +on Zulip stream. Alerts displayed on OTC Alerta are generated either by Executor, Scheduler, EpMon or by Grafana. diff --git a/doc/source/internal/apimon_training/dashboards.rst b/doc/source/internal/apimon_training/dashboards.rst index 8b92c12..985c11f 100644 --- a/doc/source/internal/apimon_training/dashboards.rst +++ b/doc/source/internal/apimon_training/dashboards.rst @@ -4,7 +4,7 @@ Dashboards management https://dashboard.tsi-dev.otc-service.com -The authentication is centrally managed by LDAP. +The authentication is centrally managed by OTC LDAP. The ApiMon Dashboards are segregated based on the type of service: @@ -29,14 +29,14 @@ views can be adjusted based on chosen value. OTC KPI Dashboard ================= -OTC KPI dashobard was requested by management to provide SLA like views on +OTC KPI dashboard was requested by management to provide SLA like views on services including: - Global SLI views (Service Level Indicators) of API availability, latency, API errors - Global SLO views (Service Leven Objectives) - Service based SLI views of availability, success rate, errors counts, latencies - - Custome service views for specific case like OS boot time duration, server - provisioning failues, volume backup duration, etc + - Customer service views for specific case like OS boot time duration, server + provisioning failures, volume backup duration, etc https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1 @@ -46,11 +46,11 @@ of the specific service. .. image:: training_images/kpi_dashboard.png -24/7 Mission control dasbhoards +24/7 Mission control dashboards =============================== 24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present -them on their own customized dashboards which are fullfilling their +them on their own customized dashboards which are fulfilling their requirements. https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m @@ -74,7 +74,7 @@ https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgI ApiMon Test Results Dashboard ============================= -This dasbhoards summarize the overall status of the ApiMon playbook scenarios +This dashboard summarizes the overall status of the ApiMon playbook scenarios for all services. The scenarios are fetched in endless loop from github repository (:ref:`Test Scenarios `), executed and various metrics (:ref:`Metric Definitions `) are collected. @@ -86,27 +86,27 @@ On this dashboard users can immeditaly identify: - count of API errors - which scenarios are passing, failing, being skipped, - how long these test scenarios are running - - the list of failed scenarios with links to ansible playbook output.log + - the list of failed scenarios with links to Ansible playbook output.log Based on historical trends and annotations user can identify whether sudden change in the scenario behavior has been impacted by some planned change on -platform (JRIA annotations) or whether there's some new outage/bug. +platform (JIRA annotations) or whether there's some new outage/bug. .. image:: training_images/apimon_test_results.jpg Service Based Dashboard ======================= -The dashboad provides deeper insight in single service with tailored views, +The dashboard provides deeper insight in single service with tailored views, graphs and tables to address the service major functionalities abd specifics. https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1 -For example in Compute Service Statistics such dasbhoard include: +For example in Compute Service Statistics such dashboard include: - Success rate of ECS deployments across different availability zones - Instance boot duration for most common images - - SSH succesfull logins + - SSH successful logins - Metadata server latencies and query failures - API calls duration - Bad API calls @@ -125,24 +125,24 @@ Custom Dashboards ================= Previous dashboards are predefined and read-only. -THe further customization is currently possible via system-config in github: +The further customization is currently possible via system-config in github: https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon -The predefined dashboard jinja templates are stored there and can be customized +The predefined dashboard Jinja templates are stored there and can be customized in standard gitops way (fork and pull request) In future this process will be -replaced by simplified dashboard panel definition in stackmon github -repostiory(https://github.com/stackmon/apimon-tests/tree/main/dashboards) +replaced by simplified dashboard panel definition in Stackmon Github +repository (https://github.com/stackmon/apimon-tests/tree/main/dashboards) -Dasbhoards can be customized also just by copy/save function directly in +Dashboards can be customized also just by copy/save function directly in Grafana. So in case of customization of Compute Service Statistics dashboard the whole dashboard can be saved under new name and then edited without any restrictions. This approach is valid for PoC, temporary solutions and investigations but -should not be used as permanent solution as customized dasbhoards which are not -properly stored on github repositories might be permanently deleted in case of -full daashboard service re-installation. +should not be used as permanent solution as customized dashboards which are not +properly stored on Github repositories might be permanently deleted in case of +full dashboard service re-installation. diff --git a/doc/source/internal/apimon_training/databases.rst b/doc/source/internal/apimon_training/databases.rst index 5c8a0b1..ba11015 100644 --- a/doc/source/internal/apimon_training/databases.rst +++ b/doc/source/internal/apimon_training/databases.rst @@ -75,7 +75,7 @@ OpenStack metrics branch is structured as following: - request method (GET/POST/DELETE/PUT) - - resource (service resource, i.e. server, keypair, volume, etc). Subresources are joined with "_" (i.e. cluster_nodes) + - resource (service resource, i.e. server, keypair, volume, etc). Sub-resources are joined with "_" (i.e. cluster_nodes) - response code - received response code diff --git a/doc/source/internal/apimon_training/difference_cmo_fmo.rst b/doc/source/internal/apimon_training/difference_cmo_fmo.rst index bfa691e..c33b734 100644 --- a/doc/source/internal/apimon_training/difference_cmo_fmo.rst +++ b/doc/source/internal/apimon_training/difference_cmo_fmo.rst @@ -8,6 +8,9 @@ Due to the ongoing transformation of ApiMon and integration to a more robust CloudMon there are two operation modes right now. Therefore it's important to understand what is supported in which mode. +This pages aims to provide navigation links and understand the changes once the +transformation is completed and some of the locations will change. + The most important differences are described in the table below: +-----------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ diff --git a/doc/source/internal/apimon_training/epmon_checks.rst b/doc/source/internal/apimon_training/epmon_checks.rst index a62aeeb..8dbb0d8 100644 --- a/doc/source/internal/apimon_training/epmon_checks.rst +++ b/doc/source/internal/apimon_training/epmon_checks.rst @@ -5,7 +5,7 @@ Endpoint Monitoring overview ============================ -EpMon is a standalone python based process targetting every OTC service. Tt +EpMon is a standalone python based process targeting every OTC service. It finds service in the service catalogs and sends GET requests to the configured endpoints. @@ -14,8 +14,8 @@ coverage, but is usually not something what can be performed very often and leaves certain gaps on the timescale of monitoring. In order to cover this gap EpMon component is capable to send GET requests to the given URLs relying on the API discovery of the OpenStack cloud (perform GET request to /servers or the -compute endpoint). Such requests are cheap and can be performed in the loop i.e. -every 5 seconds. Latency of those calls, as well as the return codes are being +compute endpoint). Such requests are cheap and can be performed in the loop, i.e. +every 5 seconds. Latency of those calls, as well as the return codes, are being captured and sent to the metrics storage. diff --git a/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst b/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst index c709f90..95af642 100644 --- a/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst +++ b/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst @@ -1,3 +1,5 @@ +.. _working_with_logs: + ============================================= How To Read The Logs And Understand The Issue ============================================= @@ -24,7 +26,7 @@ accessed from multiple locations: .. image:: faq_images/dashboard_log_links.jpg -The logs contain whole ansible playbook output and help to analyse the problem +The logs contain whole ansible playbook output and help to analyze the problem in detail. For example following log detail describes the failed scenario for ECS deployment:: diff --git a/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst b/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst index 82de332..be59af8 100644 --- a/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst +++ b/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst @@ -13,7 +13,7 @@ field can include links to other systems with more detail. In Cloudmon Dashboards annotations are used to show the JIRA change issue types which change the transition from SCHEDULED to IN EXECUTION. This helps to identify if some JIRA change has negative impact on platform in real time. The -annotations contain several fields which help to corelate the platform behaviour +annotations contain several fields which help to correlate the platform behavior with the respective change directly on the dashboard: - JIRA Change issue ID diff --git a/doc/source/internal/apimon_training/introduction.rst b/doc/source/internal/apimon_training/introduction.rst index 81f437b..67e5e36 100644 --- a/doc/source/internal/apimon_training/introduction.rst +++ b/doc/source/internal/apimon_training/introduction.rst @@ -34,8 +34,8 @@ ApiMon Architecture Summary `Github `_. - EpMon executes various HTTP query requests towards service endpoints and - generates statistsic - - Scheduler fetches the latest playbooks from repo and puts them in + generates statistics + - Scheduler fetches the latest playbooks from repo and puts them in a queue to run in a endless loop. - Executor is running the playbooks from queue and capturing the metrics - The ansible playbook results generates the metrics (duration, result). @@ -69,8 +69,8 @@ ApiMon comes with the following features: - internal (OTC) - external (vCloud) -- Alerts agregated in Alerta and notifications sent to zulip -- Various dasbhoards +- Alerts aggregated in Alerta and notifications sent to zulip +- Various dashboards - KPI dashboards - 24/7 squad dashboards @@ -102,7 +102,7 @@ possible): - No synthetic workloads: The service is not simulating any workloads (for example a benchmark suite) on the provisioned resources. Instead it measures and reports only if APIs are available and return expected results with an - expected behaviour. + expected behavior. - No every single API monitoring .The API-Monitoring focuses on basic API functionality of selected components. It doesn't cover every single API call available in OTC API product portfolio. diff --git a/doc/source/internal/apimon_training/logs.rst b/doc/source/internal/apimon_training/logs.rst index 25b6d68..6454274 100644 --- a/doc/source/internal/apimon_training/logs.rst +++ b/doc/source/internal/apimon_training/logs.rst @@ -9,7 +9,7 @@ Logs - Every single job run log is stored on OpenStack Swift object storage. - Each single job log file provides unique URL which can be accessed to see log details -- These URLs are available on all APIMON levels: +- These URLs are available on all ApiMon levels: - In Zulip alarm messages - In Alerta events @@ -38,3 +38,9 @@ Logs 2020-07-12 05:54:48.505906 | TASK [Delete SecurityGroup] 2020-07-12 05:54:50.727174 | localhost | changed 2020-07-12 05:54:50.745541 | + + +For further details how to work with logs please refer to :ref:`How To Read The +Logs And Understand The Issue + ` FAQ page. + diff --git a/doc/source/internal/apimon_training/metrics.rst b/doc/source/internal/apimon_training/metrics.rst index 244330f..ebf7e7c 100644 --- a/doc/source/internal/apimon_training/metrics.rst +++ b/doc/source/internal/apimon_training/metrics.rst @@ -4,7 +4,7 @@ Metrics ======= -The ansible playbook scenarios generate metrics in two ways: +The Ansible playbook scenarios generate metrics in two ways: - The Ansible playbook internally invokes method calls to **OpenStack SDK libraries.** They in turn generate metrics about each API call they do. This @@ -41,16 +41,17 @@ The ansible playbook scenarios generate metrics in two ways: Custom metrics: In some situations more complex metric generation is required which consists of -execution of multiple tasks in scenario. For such cases the tags parameter is +execution of multiple tasks in scenario. For such cases, the tags parameter is used. Once the specific tasks in playbook are tagged with some specific metric name the metrics are calculated as sum of all executed tasks with respective -tag. It's useful in cases where measured metric contains multiple steps to -achieve the desired state of service or service resource. For example boot up of -virtual machine from deployment until succesfull login via SSH. +tag. It's useful in cases where the measured metric contains multiple steps to +achieve the desired state of service or service resource. For example, boot up of +virtual machine from deployment until successful login via SSH. .. code-block:: tags: ["metric=delete_server"] tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"] -More details how to query metrics from databases are described on :ref:`Metric databases ` page. +More details how to query metrics from databases are described on :ref:`Metric +databases ` page. diff --git a/doc/source/internal/apimon_training/notifications.rst b/doc/source/internal/apimon_training/notifications.rst index 541d191..0c0262e 100644 --- a/doc/source/internal/apimon_training/notifications.rst +++ b/doc/source/internal/apimon_training/notifications.rst @@ -2,15 +2,15 @@ Notifications ============= -Zulip as officialt OTC communication channels supports API interface for pushing -the notifications from ApiMon to various zulip streams: +Zulip as official OTC communication channel supports API interface for pushing +the notifications from ApiMon to various Zulip streams: - #Alerts Stream - #Alerts-Hybrid Stream - #Alerts-Preprod Stream Every stream contains topics based on the service type (if represented by -standalone ansible playbook) and general apimon_endpoint_monitor topic whihc +standalone Ansible playbook) and general apimon_endpoint_monitor topic which contains alerts of GET queries towards all services. If the error has been acknowledged on Alerta, the new notification message for diff --git a/doc/source/internal/apimon_training/test_scenarios.rst b/doc/source/internal/apimon_training/test_scenarios.rst index fea1a47..660543c 100644 --- a/doc/source/internal/apimon_training/test_scenarios.rst +++ b/doc/source/internal/apimon_training/test_scenarios.rst @@ -12,41 +12,42 @@ python script). With Ansible on it's own having nearly limitless capability and availability to execute anything else ApiMon can do pretty much anything. The only expectation is that whatever is being done produces some form of metric for further analysis and evaluation. Otherwise there is no sense in monitoring. The -scenarios are collected in a Git repository and updated in real-time. In general -mentioned test jobs do not need take care of generating data implicitly. Since -the API related tasks in the playbooks rely on the Python OpenStack SDK (and its -OTC extensions), metric data generated automatically by a logging interface of -the SDK ('openstack_api' metrics). Those metrics are collected by statsd and -stored to :ref:`graphite TSDB `. +scenarios are collected in a `Git repository +`_ and updated in +real-time. In general mentioned test jobs do not need take care of generating +data implicitly. Since the API related tasks in the playbooks rely on the Python +OpenStack SDK (and its OTC extensions), metric data generated automatically by a +logging interface of the SDK ('openstack_api' metrics). Those metrics are +collected by statsd and stored to :ref:`graphite TSDB `. -Additionall metric data are generated also by executor service which collects +Additionally metric data are generated also by executor service which collects the playbook names, results and duration time ('ansible_stats' metrics) and stores them to :ref:`postgresql relational database `. -The playbooks with monitoring scenarios are stored in separete repository on -`github `_ (the location -will change with CloudMon replacement in future). Playbooks address the most -common use cases with cloud services conducted by end customers. +The playbooks with monitoring scenarios are stored in separate repository on +`github `_ (the location +will change with CloudMon replacement in `future +`_). Playbooks address the most common use cases +with cloud services conducted by end customers. The metrics generated by Executor are described on :ref:`Metric Definitions ` page. In addition to metrics generated and captured by a playbook ApiMon also captures -:ref:`stdout of the execution `. and saves this log for additional analysis to OpenStack -Swift storage where logs are being uploaded there with a configurable retention -policy. +:ref:`stdout of the execution `. and saves this log for additional +analysis to OpenStack Swift storage where logs are being uploaded there with a +configurable retention policy. New Test Scenario introduction ============================== - -As already mentioned playbook scenarios are stored in separete repository on +As already mentioned playbook scenarios are stored in separate repository on `github `_. Due to the -fact that we have farious environments which differ between each other by +fact that we have various environments which differ between each other by location, supported services, different flavors, etc it's required to have monitoring configuration matrix which defines the monitoring standard and scope -for each enviroment. Therefore to enable +for each environment. Therefore to enable playbook in some of the monitored environments (PROD EU-DE, EU-NL, PREPROD, Swisscloud) further update is required in the `monitoring matrix `_. @@ -57,7 +58,7 @@ Rules for Test Scenarios ======================== Ansible playbooks need to follow some basic regression testing principles to -ensure sustainability of the endless exceution of such scenarios: +ensure sustainability of the endless execution of such scenarios: - **OpenTelekomCloud and OpenStack collection** @@ -83,8 +84,8 @@ ensure sustainability of the endless exceution of such scenarios: - **Simplicity** - - Do not overcomplicate test scenario. Use default auto-autofilled parameters - whereever possible + - Do not over-complicate test scenario. Use default auto-filled parameters + wherever possible - **Only basic / core functions in scope of testing** @@ -93,7 +94,7 @@ ensure sustainability of the endless exceution of such scenarios: - Focus only on core functions which are critical for basic operation / lifecycle of the service. - The less functions you use the less potential failure rate you will have on - runnign scenario for whatever reasons + running scenario for whatever reasons - **No hardcoding** @@ -110,20 +111,23 @@ ensure sustainability of the endless exceution of such scenarios: Custom metrics in Test Scenarios ================================ -OpenStack SDK and otcextensions support metric generation natively for every -single API call and ApiMon executor supports collection of ansible playbook -statistics so every single scenario and task can store its result, duration and -name in metric database. + +OpenStack SDK and otcextensions (otcextensions covers services which are out of +scope of OpenStack SDK and extends its functionality with services provided by +OTC) support metric generation natively for every single API call and ApiMon +executor supports collection of ansible playbook statistics so every single +scenario and task can store its result, duration and name in metric database. But in some cases there's a need to provide measurement on multiple tasks which represent some important aspect of the customer use case. For example measure -the time and overall result from the VM deployment until succesfull login via +the time and overall result from the VM deployment until successful login via SSH. Single task results are stored as metrics in metric database but it would be complicated to transfer processing logic of metrics on grafana. Therefore tags feature on task level introduces possibility to address custom metrics. -In following example the custom metric stores the result of multiple tasks in special metri name create_server:: +In following example the custom metric stores the result of multiple tasks in +special metric name create_server:: - name: Create Server in default AZ openstack.cloud.server: