Degraded Performance

Incident Report for Mobaro

Postmortem

Database Service Outage Post-Mortem

Incident Duration: April 8, 2025, 07:00 UTC - 08:00 UTC (1 hour)

Summary

On April 8, 2025, we experienced a service outage that lasted for one hour, from 07:00 UTC to 08:00 UTC. The incident was triggered by a deployment of a database indexing process update that caused unexpected issues.

Timeline

  • 07:00 UTC: Deployment of a small update to our database indexing processes began.
  • Shortly after deployment: Team identified a malfunction in the update process affecting one of our nodes.
  • Following detection: Attempted to halt the deployment, which unfortunately caused the entire cluster to become unresponsive.
  • Following unresponsiveness: Initiated a controlled, node-by-node restart of the database cluster.
  • 08:00 UTC: Service fully restored after completing the restart process.

Impact

The database cluster was unresponsive during the incident window. While the outage lasted longer than anticipated due to the extended restart process, we can confirm that all data remained intact and secure throughout the incident.

Root Cause

A malfunction occurred during the deployment of a database indexing process update. When we attempted to stop the deployment to address this malfunction, it unexpectedly caused the entire cluster to become unresponsive, necessitating a full restart.

Resolution

We performed a safe, methodical node-by-node restart of the entire database cluster. While this process took longer than initially expected, this careful approach ensured data integrity was maintained throughout the recovery.

Next Steps

Our team is currently:

  1. Investigating the specific cause of the initial update malfunction.
  2. Reviewing our deployment rollback procedures to prevent similar cluster-wide impacts.
  3. Evaluating ways to optimize our restart processes to reduce recovery time.
  4. Increasing resources available for each node.

We apologize for any inconvenience this outage may have caused. We remain committed to improving our systems to prevent similar incidents in the future.

Posted Apr 08, 2025 - 10:42 CEST

Resolved

This incident has been resolved.
Posted Apr 08, 2025 - 10:17 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 08, 2025 - 09:57 CEST

Update

We are currently investigating this issue.


As a possible workaround set your mobile device to "Airplane mode" will allow the app to respond a bit faster during this incident.
Posted Apr 08, 2025 - 09:36 CEST

Investigating

We are currently investigating this issue.
Posted Apr 08, 2025 - 09:17 CEST
This incident affected: Mobile App, Backend Application, RideOps, Public API, and Single Sign-On.