ObservAI Support
What is ObservAI?
ObservAI is an AI-powered observability and incident response platform that works across clouds and on-premises. It helps teams detect, diagnose, and resolve production issues faster.
The platform ingests logs and telemetry from your workloads via Connectors (e.g. GCP, AWS, Azure) or OpenTelemetry (OTLP), uses machine learning and AI to detect anomalies, performs root cause analysis, and suggests or executes remediation—with configurable human oversight.
Value: Reduce mean time to resolution (MTTR), improve reliability, and maintain a full audit trail of incidents and automated actions. Vendor-agnostic: use the connector that matches your environment.
How it works
Data ingestion — Connect your environment using the appropriate Connector (GCP, AWS, Azure, or OpenTelemetry). Logs flow via the connector's native pipeline (e.g. cloud logging → message queue → ObservAI) or via OTLP. The connector's primary log path runs full anomaly detection; OTLP logs are stored for querying and analysis.
Anomaly detection — Incoming logs are evaluated by an ML ensemble and optional AI validation. Only events that meet the anomaly threshold are escalated, reducing noise.
Incident management — Escalated events are deduplicated and grouped into incidents so that many similar alerts become a single incident. This prevents alert storms and keeps focus on root cause.
Traceability and correlation — The system correlates each anomaly with related logs and service topology (including OpenTelemetry data) to identify the root component and propagation path.
Remediation and execution — AI suggests concrete actions (e.g. restart, scale, rollback). Actions can be auto-executed, sent for approval, or shown as suggestions—depending on your policy and confidence settings.
Root cause analysis — For each incident, the platform produces a structured root cause summary with category, evidence, and impact.
Notifications — You receive one consolidated email per incident, with anomaly details, impact, remediation plan, RCA, and execution status. Approval requests are sent when human approval is required.
Human-in-the-loop — When configured, actions that need approval appear in the Approvals page and by email. You can approve, reject, or let the request expire. After execution, the system can verify outcomes and trigger rollback if needed.
For more detail on each part of the pipeline, see Capabilities and components.
Capabilities and components
Anomaly detection
Evaluates each log line using an ML ensemble and optional AI validation. Supports ingestion from Connectors (provider-specific log pipelines) and OpenTelemetry (OTLP). Full ML-based detection runs on the connector's primary log path; OTLP logs are stored for querying. Configure ingestion via the Connector for your environment; the component runs automatically once data is connected.
Orchestration
Central coordinator. Receives detected anomalies, deduplicates and suppresses bursts, and groups events into incidents. Runs traceability, remediation, execution, RCA, and notification in sequence. Processing is asynchronous so ingestion stays responsive.
Traceability
Correlates the anomaly with related logs (e.g. by trace ID, dependency graph, time). Uses OpenTelemetry-derived topology to identify the root component and failure propagation path. Runs first so remediation and RCA have full context. Results appear in the Audit view and in incident emails.
Remediation
Uses AI to produce an actionable remediation plan with structured suggestions (action type, target component, urgency, safety). These drive the Executor. Runs for the primary event of each incident. You see remediation steps in the Audit tab and in notification emails.
Executor
Maps remediation suggestions to your pre-approved actions (defined in Policies). Applies confidence-based behavior: high confidence can auto-execute, medium can require approval, low can be suggestion-only. Manages approval requests (email and Approvals page), execution, verification, and optional rollback. You interact via Policies (define and tune actions) and Approvals (approve or reject).
Root cause analysis (RCA)
Produces a structured root cause for each incident: category, evidence, contributing factors, and impact. Runs for primary events. Results appear in the Audit view and in incident emails.
Notifications (Collaborator)
Sends one consolidated HTML email per incident with anomaly details, impact, remediation plan, RCA, and execution status. Also sends approval-request and execution-result emails. Throttling ensures you do not receive duplicate emails for the same incident.
Analyzer
Natural language query over your observability data. Use the Analyzer in the UI to ask questions in plain language and receive streaming results with optional AI follow-up. Independent of the incident pipeline; useful for ad-hoc investigation and reporting.
Recommender
Proactive code analysis for security, performance, reliability, and maintainability. Integrates with GitHub or GitLab. Use it for pre-production quality; separate from the real-time incident flow.
Connectors
Configure how data reaches ObservAI. Connectors are available for GCP, AWS, Azure, and other environments. Each connector provides ingestion pipelines (provider-native and/or OTLP), a dashboard for config and pipeline controls where applicable (e.g. start/stop, collector status, log samples), and optional metrics. See section 6a for data ingestion and configuration.
Key concepts
Incident
One or more related anomaly events grouped by the platform (e.g. by trace ID or message similarity). Has status, severity, and one primary event that drives full analysis and remediation.
Anomaly
A single log or event flagged by the anomaly detection component. Many anomalies can map to one incident.
Primary vs secondary events
The primary event triggers the full pipeline (traceability, remediation, execution, RCA, notifications). Duplicates and bursts are grouped as secondary and do not re-trigger the pipeline.
Approval request
When Executor confidence is in the "approval" band, it creates a request; user sees it in Approvals and email; approve/reject/timeout.
Confidence tiers
Executor: auto-execute (high), approval (medium), suggestion only (low). Thresholds are configurable per action.
Pre-approved actions
Defined in Policies; define when an action applies (error category, component pattern, severity) and safety constraints. Executor only runs actions that match and pass safety checks.
Using the platform
Dashboard
Component health, workload/cluster metrics, and analytics. Use it to confirm platform health and that ingestion is flowing.
Analyzer
Type a question in natural language; the system queries your observability data and streams results. Use for ad-hoc exploration of logs and anomalies.
Audit
Incident-centric view: anomaly, RCA, remediation, executor, timeline. Use for post-incident review and compliance.
Approvals
Pending actions needing your decision; countdown timers, approve/reject with optional reason. Act before timeout if you want to allow execution.
Policies
Define and edit pre-approved actions and matching rules. Required for any automated or approval-based execution.
Connectors / Settings
Configure ingestion per connector (GCP, AWS, Azure, OTLP, etc.) and platform connections.
Getting started and ingestion
Connector-based ingestion
Use the Connector that matches your environment (GCP, AWS, Azure, etc.). Each connector wires your provider's logging (and optionally metrics/traces) to ObservAI. The connector's primary log path runs full anomaly detection. Follow the in-app Connectors guide and the connector's setup steps (e.g. create sink/topic/queue, set push endpoint, configure auth).
OpenTelemetry (OTLP)
You can also send logs, metrics, and traces from any OTLP collector or instrumentation. Logs are stored for querying and analysis; metrics enrich incident context. Configure the OTLP endpoint and authentication (e.g. Bearer token or API key) in your deployment or connector.
What to configure
Depends on the connector: typically provider project/account, region, topic/queue/endpoint, and ObservAI API base URL and API key. Use consistent component or service naming so traceability and action matching work as expected.
Connectors: data ingestion and configuration
ObservAI is vendor-agnostic. Connectors are available for GCP, AWS, Azure, and other environments. Each connector streams observability data into the platform and may provide a dashboard for config and pipeline controls.
Typical pipeline patterns
- Provider-native log pipeline — Workloads write to stdout or the provider's logging service. The connector (or your setup) routes those logs to ObservAI (e.g. via a message queue or push subscription). This path runs full anomaly detection. Authentication is provider-specific (e.g. OIDC, IAM, API keys).
- OpenTelemetry (OTLP) — An OTel Collector or instrumented apps send logs, metrics, and traces to ObservAI via OTLP HTTP. Logs are stored for querying (e.g. Analyzer); metrics enrich incident context. Authentication is typically Bearer token or API key.
Configuration
- Provider: Project/account ID, region, and any provider-specific resources (e.g. topic, subscription, queue).
- ObservAI: API base URL and API key (or equivalent) for the connector and/or OTel exporter.
- Optional: Cluster/namespace or resource labels for context; data store for log sampling in the connector dashboard (where supported).
Configure via the connector's dashboard (Setup Wizard and Configuration panel, where available) or via environment variables / manifests used by the connector and OTel Collector.
Pipeline controls (where supported)
Some connectors expose Start / Stop (enable or disable delivery to ObservAI), Restart (clear backlog and start), and Clear queue. Use the connector dashboard to check pipeline status and control delivery.
OTel Collector (where used)
If the connector or your setup uses an OTel Collector (e.g. DaemonSet), you can check status, Restart, and Enable/Stop from the connector UI or cluster. Collector config references the ObservAI endpoint and auth (env or secret).
Log sampling and metrics (where supported)
Connector dashboards may offer log samples (preview from your logs store), pipeline metrics (throughput, errors), and optional workload metrics (e.g. container CPU/memory) synced to ObservAI as OTLP.
Where to configure
- First-time: Use the connector's Setup Wizard (Configure → Test → Complete): provider credentials, region, topic/queue/endpoint, ObservAI endpoint, API key. Save and test connection.
- Current values: The connector's Configuration panel (or equivalent) shows current settings. Infrastructure (sinks, topics, IAM, collector manifests) is created via the connector's scripts or your provider's console, as documented per connector.
Summary for support
- Anomaly detection runs on the connector's primary log pipeline. OTLP logs are stored and queryable; they do not go through ML detection unless the connector sends them over that primary path.
- Authentication: Depends on the connector (e.g. OIDC, IAM, Bearer). Ensure audience/endpoint and API key are correct where the connector or OTel calls ObservAI.
- No data: Verify the connector's pipeline is started (push/stream enabled), provider resources (sink, topic, queue) match the connector config, and (if used) OTel Collector is running with correct endpoint and auth. See the connector's documentation.
Troubleshooting
No anomalies detected
Confirm logs are reaching the platform via your Connector or OTLP. Note that low-severity entries (e.g. INFO, DEBUG) are not evaluated for anomalies. Check the Dashboard to ensure the anomaly detection component is healthy. Anomaly detection runs on the connector's primary log path; OTLP-only logs are stored for the Analyzer.
Many similar events, one incident
The platform groups related events into a single incident to reduce noise. Only the primary event triggers the full pipeline. This is expected; see Key concepts.
No remediation or execution
Remediation runs for the primary event of each incident. Automated or approval-based execution requires a matching pre-approved action in Policies. Check that your action rules (error category, component pattern, severity) align with the incident, and review the confidence tier (suggestion vs approval vs auto-execute).
Approval email not received
Check the Approvals page in the UI. Verify email (SMTP) configuration and notification settings. Approval requests expire after the configured timeout if not answered.
Action did not execute
Verify that a pre-approved action in Policies matches the incident (component, error category, severity). Check safety constraints such as environment, time windows, and rate limits, and confirm the execution target is correctly registered.
Analyzer query fails or times out
Ensure your observability data store and permissions are correct. Large or complex queries may hit the query timeout or result limit; try a narrower time range or more specific filters.
Component or service unhealthy
The Dashboard shows the status of platform components. Check connectivity to your data store, database, and AI services as applicable for your deployment.
Connector: no logs or pipeline not running
In the connector dashboard, confirm the pipeline is started (push/stream enabled). Verify provider resources (e.g. sink, topic, queue) match the connector config and that the connector's auth (e.g. OIDC, API key) is set correctly. For OTLP: check the OTel Collector is running and the exporter endpoint and Bearer token are correct. See section 6a.
FAQ
Why did I receive one email for many similar errors?
The platform groups related events into a single incident. You receive one consolidated email per incident.
Why was no action executed?
Execution requires a matching pre-approved action in Policies. If the match confidence is in the suggestion band, the platform will only suggest; if safety checks (environment, rate limit, time window) fail, execution is blocked.
Can I change when actions auto-execute vs require approval?
Yes. In Policies, you can configure confidence thresholds per action (auto-execute, approval required, or suggestion only).
Do OpenTelemetry (OTLP) logs run through anomaly detection?
Full ML-based anomaly detection runs on the connector's primary log pipeline (e.g. provider logging path). OTLP logs are stored and can be queried via the Analyzer.
How do I add a new automated action?
In Policies, create a pre-approved action and define matching rules (error category, component pattern, severity, etc.) and the command template and safety constraints. The platform will use it when remediation suggestions match.
How do I set up data ingestion?
Use the Connector for your environment (GCP, AWS, Azure, etc.): follow the connector's setup guide to create the required provider resources (e.g. sink, topic, queue) and auth; configure the connector (project/account, region, endpoint, ObservAI URL, API key) in the Setup Wizard; start the pipeline. Optionally deploy or enable the OTel Collector for metrics and OTLP logs. See section 6a for details.
Support and documentation
Documentation
For architecture details, component behavior, and design decisions, refer to the full technical documentation provided with the solution or in the deployment package.
Marketplace or deployment channel
For billing, subscription, and channel-specific issues, use the support channel associated with your listing (e.g. Google Cloud Marketplace, AWS Marketplace, or your deployment channel).
Vendor support
Still have questions? Email support@observai.ai for issues or topics not covered in this guide or the documentation.