How We Built the Most Reliable Data Pipeline Ever

From five-engineer startup to 200-engineer unicorn, we’ve remained focused on one goal: making business data as accessible as electricity.

Only a few years ago, 新巴黎人官方 was a five-engineer data integration startup. Now we’re a 200-engineer unicorn with this statistical profile:

It’s taken a lot of hard work to get to this point. Building a data pipeline requires just three straightforward steps — extract data, process and normalize it, and write it to a destination — but it took us many years to master those steps.

Here’s the central lesson we’ve learned: You can’t build a data replication solution once and expect it to work reliably forever, because source systems are too complex. APIs break or work unexpectedly, and there are so many edge cases that only time — and a lot of customers — can help you find and address them. On top of that, source systems continuously evolve and challenge us to adjust to those changes to make sure the replication works quickly and reliably.

This is the story of how we learned that lesson, and many others. We think of our engineering evolution as having four distinct developmental stages: crawl, walk, run and fly. We’re running now — and getting ready to fly.

First Things First: When Is a Data Pipeline “Reliable”?

We consider a pipeline reliable when:

  • All the expected data arrives at its destination
  • There are no data integrity issues
  • Any unauthorized access to the data is prevented
  • Unexpected behavior and transient issues of the source system are tolerated
  • Sync performance is adequate
  • Regional data requirements are respected and enforced

Having a reliable data pipeline comes down to these fundamentals:

Monitoring & observability. Monitoring and alerting systems must be in place. There should be a method for identifying the root cause of an issue so it can be addressed quickly.

Incident management. There must be a process to address incidents affecting customers, whenever they are reported.

Source API failures handling. Preparations must be in place for source-side failures and unexpected behavior.

Source API change tracking. To avoid breakage caused by contract changes on the source side, there should be proactive monitoring.

Customer support. While not directly related to reliability, support must be able to quickly triage customer requests and decide whether a simple response is needed — or if the issue should pass to the engineering team, quickly.

There are other aspects such as security, privacy and data regionality. We will leave out those issues because they deserve a separate article.

Stage 1: Crawl – Making It Work

In the first stage, our goal was to make the product work. Back then, the components that made a reliable data pipeline required a great deal of attention and rigor. Nothing happened quickly.

Monitoring and observability: Our “Wall of Shame”

Our first monitoring system was a scheduled job that sent an email with all the failing connectors to the development team twice a day. We called it the Wall of Shame. The dev team would look at this list of failing connectors and address issues — but only the ones concerning multiple connectors. In those early years, a balance between continuous reliability improvement and feature development was extremely important, because we needed to continue growing our business.

As for observability, looking into logs was our primary way to identify issue root causes. An engineer would introduce a fix and then run the connector locally to see if it succeeded. If there was success, the change made its way to production.

Incident management: The whole village

If any critical incident was noticed by employees or customers, a message was posted in the fire-alarm channel in Slack. The dev team jumped on it until we could be sure the fire had been extinguished. If any follow-up work was required, a task was assigned to an engineer to address it in the upcoming days (or weeks).

Source API failures handling: Better done than perfect

We relied on the diligence of the engineering team to manage failures on the source side. If an engineer considered edge cases and questioned any assumptions, we then went ahead, even though our engineering team had only five engineers (including CEO and VPE), and the required level of investment meant it wasn’t always feasible.

As with any software, connectors need time to mature. They must face real customer use cases and data volumes because this is the only way all the source API kinks can be found and addressed. Usually none of these kinks are documented — often the source company engineers are not aware of them, either, which makes it near-impossible to spot issues with demo sources.

Source API change tracking: Are we lucky?

Even though we had automated tests that helped us to find contract changes on the source side (e.g., Facebook deprecating an API endpoint), finding them was often a matter of luck. If we were able to spot an upcoming contract change, the connector didn’t break. If we didn’t, the connector was broken and we had to rush to fix it.

Customer support: Around the clock

All support team members were in the U.S. (as well as most 新巴黎人官方 employees, then). Given that support existed in one timezone, the team had to work around the clock to make sure our response time met customer expectations. We worked on a first-come-first-serve basis, without any recording of metrics.

Stage 2: Walk – Automations and Processes to the Rescue

We invested in a second engineering site in Bangalore. The processes and workflows evolved, and we got to the next level of reliability. It was time to focus on improving our reliability even further.

Monitoring & observability: Metrics & bots

Relying on the engineering or the support team to check the Wall of Shame wasn’t good enough — we opted for automation. We created BugBot to automatically file tickets for any connector that was down for more than 48 hours.

In addition to logs, we gathered a few critical metrics, such as amount of failing connectors and sync-queue size. We put in place automations to post alarms in our fire-alarm channel, when metrics hit thresholds that demanded immediate attention (such as if there were too many failing connectors).

Incident management: Superheroes

We introduced an on-call rotation of “superheroes.” This was a pair of engineers in two different time zones (PST and IST) who were available 24/7 for a week. These superheroes provided immediate response to any incoming incidents and had to know the product well enough to troubleshoot all incoming incidents. If they required additional input from the team responsible for the broken product area, it sometimes meant there was a lag until the team on the other side of the globe got online.

Source API failures handling: Framework & best practices

We created a variety of best practices, such as a procedure for querying for data incrementally to avoid data integrity issues. We learned how to handle specific types of failures on the source side and how to implement retries. Our connectors framework evolved to require less effort to create new connectors. The quality of new connectors improved. The existing connectors also matured: As an increasing number of customers used them, we could address edge cases and kinks on the source side.

Source API change tracking: API update check-ins

Instead of hoping that nothing bad was going to happen (and assuming source APIs would always remain stable), we introduced regular check-ins. Engineers responsible for their connectors verified for contract changes. Engineers filed tickets to address those changes.

Customer support: Around the globe

We focused on three major regions, North America, EMEA and APAC. We built a support organization in each region that enabled our teams to implement processes to work on high-priority issues — and then pass them along to the next support team as it came online. While our teams were small at first, we were able to significantly improve our response times and drive down our backlog of tickets from 100+ per person. This follow-the-sun process sped time-to-resolution for a much improved customer experience.

Keeping engineering focused on building software rather than triaging tickets was also a major first step in the specialization of roles.

Stage 3: Run – Where We Are Today

Today we have engineers in four locations and our global support team is in three different time zones. We additionally have built a dedicated SRE (site reliability engineering) team to focus on the reliability of the 新巴黎人官方 platform.

Monitoring & observability: Runbooks & monitoring

The engineering team has invested a great deal of effort preparing detailed internal documentation, alerts, and runbooks for all the services we support. The team is notified of issues once they happen — or even earlier — so they can be addressed before they impact our customers. The alerting system files tickets automatically and notifies the responsible team.

In addition to logs and metrics, we have a full-featured application performance monitoring (APM) solution in place that allows us to not only understand the cause of an issue, but also how to optimize performance and fix bottlenecks.

Incident management: Zero-incident policy

Because of our growth and the complexity of our product features, it is no longer a sustainable solution to maintain a pair of on-call engineers. Today, issues are tackled by the experts in the domain, and they get resolved much more quickly. To reduce the pressure of being on-call 24/7, each team has a pair team located in a different time zone, allowing for a follow-the-sun model.

Additionally, there is a manager on-call rotation to make sure someone is always available to support the team in case of critical incidents. All incidents are filed into our task management system, automatically.

We have a zero-incident policy: All the incidents are addressed within a 21-day time window. This policy means teams must prioritize incidents over feature work. This trade-off is worth it: consistency and reliability is vital when it comes to data.

Source API failures handling: Shared components

To improve the quality of the connectors, we now group them by functional domains and extracted shared components. This improves quality, but also allows us to make similar connectors work more consistently so they can become more robust and predictable. We have also introduced more reusable utilities to make connector development less error-prone.

Regardless of the investments in the framework, reusable components, and shared core, connector building requires effort and diligence to get the data schema right. The goal is to find the best way to fetch information while preserving data integrity and identifying and going around source API corks.

Source API change tracking: Crawl docs

The process for tracking changes on the source side still involves regular manual check-ins. Nevertheless, we are working on a solution that allows us to track changes automatically by crawling the source documentation pages. The early experiments have been successful.

Customer support: Foundational to customer success

One of the core principles of 新巴黎人官方 is to be as “reliable as electricity.” Our customers depend on us to deliver all their data from source to destination with reliability and integrity. That’s why we’ve invested heavily in our customer success organization.

We’ve hired a global team of Customer Success Managers (CSM), Technical Success Managers (TSM), and evolved our support team into a Global Customer Support team (GCS). Our GCS team is the backbone of our customer success organization and we continue to evolve the way we support our customers.

We’ve refined our follow-the-sun process to allow customer’s requests and incidents to be addressed on a continual basis until a resolution is provided. We constantly monitor our resolution times for incremental improvements.

In that vein, we have launched:

  • Feature request portal: While we are adding new connectors and services every month, it is extremely important we provide a way for customers to request features and connectors.

Stage 4: Fly – Where We’re Headed

In addition to the comprehensive monitoring and observability system, we are building a Pipeline Flight Traffic Control System. This is a high-level monitoring dashboard that will provide an overview of the health and performance of all running syncs.

While a reliable incident management process is great, the best way to deal with incidents is to prevent them from happening in the first place. That’s why our new and growing QA team is working hard to put in place as many protections possible before new code is shipped to production.

Data reliability is a must have for the automated data pipeline industry, and that’s why 新巴黎人官方 is building a modern data stack strategy with its customers.

Continuing to build our self-service capabilities and community allows us to work with our customers to build their own data strategies. Customers can become truly data-driven businesses. They can break down data silos and bring many data sources together for a holistic view of their organizations.

Data access will push business forward. But getting to where has required every bit of dedication and energy we have —  we had to crawl, walk, and run. Our story isn’t over: We’re out to fly.

Start for free

Join the thousands of companies using 新巴黎人官方 to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.