Blogs

Community Show: Reltio Data Sharing with Databricks: Lessons and proven best practices

By Sara Brams-Miller posted yesterday

  

Our most recent community show revisited Reltio Data Sharing with Databricks, with a practical focus on lessons learned, setup guidance, and best practices from real-world implementations. The session, led by Ankur Gupta, covered how teams can securely make trusted Reltio data available in Databricks for analytics and AI/ML workloads—without copying or moving data.

Reltio Data Sharing with Databricks

Ankur started with a quick overview of how Reltio Data Sharing with Databricks works. Reltio provides trusted, unified data across key objects such as entities, relationships, interactions, matches, merges, activities, and workflows. Through Delta Sharing, that data can be made available in a Databricks account using a simple setup process.

The demo highlighted three important benefits:

  • Simple setup: Create a data share in Reltio, provide the Databricks sharing identifier, and activate the share

  • Zero-copy access: Make Reltio data available in Databricks without moving it out of Reltio

  • Flexible formatting: Choose OV-only data for schema simplification, or share all values in a column-struct hierarchical format

OV-only data and schema simplification

A key topic in the session was how to choose between OV-only data and all values. When OV-only is selected, Reltio shares only operational values and simplifies the schema into a tabular format where each attribute in Reltio’s data model is shared as a separate column in the tables that are part of the data share. This makes the data easier to consume in Databricks for reporting, dashboards, and analytics.

When all values are shared, both OV and non-OV values are included, and the data remains in a hierarchical column-struct format.

Ankur also clarified that customers can create two active data shares at the same time:

  • One share for OV-only values with schema simplification

  • One share for all values in hierarchical column-struct format

This gives teams flexibility when different downstream use cases require different data structures.

Best practices for setup

The session covered several setup recommendations designed to improve performance and reduce operational friction. Before creating a new data share, teams should complete their initial data load into Reltio and allow match, merge, and unification processes to finish. This helps create a steady state before sharing data into Databricks.

Ankur also recommended disabling activity log sync before the initial setup. Activity data can be much larger than other data objects, so syncing it too early can delay higher-priority data such as entities or relationships.

The recommended sequence is:

  • Disable activity log sync

  • Complete the initial data load and verification in Reltio

  • Create the outbound data share

  • Sync historical data by object type in sequence

  • Re-enable activity log sync

  • Sync historical activity data

This approach helps avoid overloading the system and supports more predictable performance.

Consuming shared data in Databricks

Once Reltio data is available in Databricks, Ankur shared guidance on how teams should consume it. Serverless compute is recommended when reading data from materialized views, based on learnings from collaboration with Databricks.

The session also clarified which datasets to use for different access patterns. For interactive or near-real-time analysis, JSON tables are designed for faster refresh but they provide data in column-struct hierarchical format and not in the schema simplified format independent of the data share setup. For reporting, dashboards, and ML pipelines, materialized views provide schema simplification based on the data share setup but have higher refresh latency.

Ankur also introduced an upgraded approach using streaming tables. These bring together the benefits of JSON tables and materialized views:

  • Near-real-time refresh

  • Schema simplification based on the data share setup

  • No change to existing table names or query structures

Customers interested in the upgraded data sharing experience were encouraged to contact Reltio for enablement.

Practices to avoid

The session also highlighted what not to do. Teams should avoid enabling data sharing before initial data loads and unification are complete. They should also avoid triggering historical sync for all objects at once, since that can create load and impact performance.

For consumption, Ankur reinforced three important cautions:

  • Do not use non-dedicated classic compute for materialized views

  • Do not build downstream solutions on landing tables

  • Do not use data sharing for non-analytical polling use cases

Reltio Data Sharing with Databricks is best suited for analytics, reporting, dashboards, and AI/ML pipelines—not for polling every few minutes to push changes into operational downstream applications.

Benchmark expectations

Ankur shared benchmark guidance to help customers plan. For a full data sync of 100 million entity records with 100 attributes, teams can expect roughly 10 hours and 15 minutes for the data to become available in Databricks.

For steady-state updates, 100,000 record changes—including inserts, updates, and deletes—can be available in approximately 9 to 10 minutes.

These benchmarks help teams set expectations for both initial setup and ongoing data freshness.

Beyond outbound data sharing

The session also briefly covered Reltio’s zero-copy integration with Databricks for interaction and transaction data. While outbound data sharing makes Reltio data available in Databricks, zero-copy integration enables Reltio to connect with interaction or transaction data in Databricks without copying it. This capability is part of Reltio Intelligent 360™ and can help teams link external interaction data to entity profiles in Reltio.

Conclusion

This community show provided a practical look at how teams can get more value from Reltio Data Sharing with Databricks. The biggest takeaway: successful adoption depends not only on enabling the share, but on following the right setup sequence, choosing the right schema option, and consuming the right tables for the right workloads.

With the right approach, teams can make trusted Reltio data available in Databricks for analytics and AI/ML use cases in a governed, scalable, and efficient way.

#CommunityWebinar

#Featured

#DataQuality

0 comments
4 views

Permalink