17 min read

Part 2: Identity Resolution: A Dive into Technologies and Approaches

A Dive into Martech Identity Resolution Technologies and Approaches for Customer Data Platforms.

Introduction

In the previous article, we explored the necessity of identity resolution in modern digital marketing landscapes - Identity Resolution is a separation of the larger Entity resolution category. I am separating it because CDP/marketers only have a small need for other ID types except Product catalogs. I am exclusively focused on identity res. and will mention entity res. occasionally in the article. We examined why identity resolution is the backbone for personalized customer experiences and robust marketing efforts. We must focus on the 'how' as we transition from understanding its significance. This follow-up article will highlight the different technological approaches that make effective identity resolution not just a concept but a working reality. By dissecting these technologies, we aim to equip you with the knowledge to successfully implement a robust identity resolution strategy. I must warn you it is a long article and only a summary of each approach with a few of my opinions and not necessarily a recommendation on which vendor you should choose.

The 5 Generic Approaches to Identity Resolution

Medallion Approach

This methodology has its roots in Master Data Management (MDM) solutions, which were partial precursors to today's Customer Data Platforms (CDPs). Notably, most current CDP and data cloud vendors leverage variations of the Medallion approach. They often infuse this foundational approach with elements from other methodologies to provide hybrid solutions tailored to client needs. Two vendors, Databricks, with its Spark and DeltaLake Analytical foundations, and Snowflake, with its data warehouse, can let their clients leverage Medallion for Identity and generic data loading, cleansing, and standardization. Many CDP vendors use Snowflake or Databricks as their underlying infrastructure, which they then abstract further for their clients to simplify the data management processes and then add tools within an approach like Medallion for Identity Resolution. Suppose a marketing dept or large enterprise has enough staff resources. In that case, they may use Snowflake or Databricks to build a customized CDP and not buy an off-the-shelf vendor but create a more composable ecosystem to suit their needs.

Example of a Medallion approach. Salesforce illustrates this graphic on its website to denote how it approaches the data factory in its Data Cloud Platform.

Bronze Stage

Initial Raw Data Ingestion: The journey starts here. Raw data from many channels like social media, CRMs, and e-commerce platforms are ingested into the system. Choosing an ingestion mechanism that can handle structured and unstructured data is crucial to ensure comprehensive coverage. Many CDP vendors have abstracted the infrastructure to make this step more straightforward and accessible by staff resources that do not have programming experience. However, there are exceptions, and newer technology like Gen-Ai will start to make this step automatic without much manual intervention. Typically there is a schema reference similar to what Adobe or Salesforce may provide. This acts as a platform data layer that is then mapped to the data origination source.

Three different schema model examples. Many vendors have tools to customize these schemas, and some do not because of their deep specialization in one vertical industry. Salesforce, Adobe, Algonomy

Data Cleansing: At this point, the data can be rife with inconsistencies—typos, redundant entries, or conflicting information. Specialized algorithms and tools like data validators and sanity checkers are used to purify the data and remove these inconsistencies.

Light Transformation: Once cleaned, the data undergoes initial transformations. This transformation could involve turning timestamps into a standardized format, converting text to lowercase, or eliminating whitespace. The goal is to prepare the data for more rigorous processing in the subsequent stages.

Examples of Cleansing rules - these rules are often customizable features within a data management platform, not necessarily the CDP, which is typically the last stage or step and acts as a final data app with a separate UI for specific capabilities like segmentation, not data cleaning.

Silver Stage

Data Standardization: The next move is to standardize it. Standardization ensures that diverse data points like addresses, dates, and names follow a uniform format. This step is crucial for accurate matching and merging using probability algorithms in the following steps. USPS provides a tool to standardize addresses.

Examples of Address standardization using USPS or Canadian PS API solutions

Data Enrichment: The data enrichment phase is crucial for enhancing customer profiles, often employing specialized services like Melissa Personator to fill in missing details such as names, addresses, and phone numbers. One valuable addition to this stage is the integration of the National Change of Address (NCOA) dataset, which is updated quarterly to keep customer information current. This dataset improves contact-ability and offers features like Prison and Deceased Suppression to refine the outreach list. NCOA enables better Householding or contact grouping by address. By combining these resources, companies can maintain more accurate and actionable customer profiles, leading to more targeted marketing efforts. Many CDPs do not use these services. One in particular that does is Algonomy.

Gold Stage

Final Deduplicated Profile: The Final De-duplication stage utilizes deterministic and probabilistic algorithms to create unique customer profiles. In deterministic matching, exact comparisons between unique identifiers or fields are made to identify duplicates. On the other hand, probabilistic matching employs statistical algorithms like Levenshtein Distance or Jaro–Winkler distance to give similarity scores, helping to catch near-matches like "Travis Smith" versus "Tavis Smith." By combining these two methods, the system efficiently removes duplicates, leaving a clean, singular view of each customer.

Fully Resolved Profile: This is the grand finale. The system combines all individual pieces of information to construct a comprehensive, 360-degree customer profile. At this point, advanced machine learning algorithms can also be applied to tag these profiles with predictive attributes, making them ready for targeted and personalized marketing strategies. The profile is sent to an actual CDP solution(comprehensive UI/UX or separated as a standalone app with a different UI/UX) so it can be used by other real-time solutions - sometimes, all of this looks like one single tool. Still, underneath, the infrastructure is separated and composable for most vendors. Therefore, the ID is pre-resolved and sent to the CDP to provide a real-time profile to another real-time campaign delivery solution.

Illustration of a Tableau dashboard for the end result from an overall process view:

Tableau dashboard embedded in Algonomy's CDP solution. Algonomy uses a hybrid Medallion approach with deterministic and probabilistic models for ID resolution.

In summary, the Medallion approach remains a go-to framework for identity resolution, highly valued for its adaptability and breadth of options across each stage. However, it's worth noting that many vendors modify this approach to better align with their platform's specific infrastructure or technical limitations. User interfaces can sometimes abstract these modifications, potentially obscuring the underlying processes. Therefore, businesses should be vigilant in understanding how a vendor has customized or shortened the Medallion approach to suit their platform, ensuring it meets their specific identity resolution needs.


Real-time Index Approach

Although real-time identity resolution is not universally adopted across Customer Data Platform (CDP) vendors, some are making strides in this direction. With new serverless and better microservices at the edge, Hybrid and Medallion approaches seem to get closer to true real-time capability. Tealium, for instance, aims to offer real-time identity resolution. As noted in Article 1, challenges arise when scaling—specifically, the accuracy of matches declines as the data set expands. This decline occurs because the larger the profile set, the more comparisons are needed, culminating in an exponential increase in complexity.

Tealium's Approach - they call it Visitor Stitching.

Limitations on Data Comparisons

With the number of required comparisons acting as a bottleneck, some vendors impose limits on the previous merges to manage the complexity. For example, they restrict the number of merges per visitor to 50 due to browser limitations—a constraint that other CDP vendors also often face - Adobe has this limitation, too, for online profile merges. Unlike systems that employ probabilistic matching, Tealium mainly uses deterministic, single-attribute (such as ID or email address) rules-based matching, which brings its own set of trade-offs.

Source from Zingg.ai to show the volumes an index faces for a single attribute that it needs to compare to find a match.

Ingestion and Indexing

In the real-time index approach, data is ingested and immediately indexed, although, at first glance, this seems to offer a distinct advantage for applications where timing is critical. Yet, this speed comes at the cost of heightened complexity and the need for resource-intensive computational backends, which these systems often do not provide. The immediate indexing can also increase infrastructure costs and introduce potential points of failure if not optimally managed. A better alternative is a hybrid approach, where pre-resolved profiles are enabled for real-time in a secondary audience sink where history can be maintained. Do not get lost in the real-time ID resolution message by vendors promoting the Real-time ID approach. In contrast, these vendors provide some advantages, but this particular feature does not provide the needed value in those solutions. The only other advantage of an RT ID Resolution feature would be using it for other use cases besides a CDP for marketing. Use cases like KYC verification and Fraud prevention. Tilores.io might be a vendor in this case.

Real-Time Query and Matching

One aspect of the real-time index approach is its immediate query processing which complements the identity-matching step. For instance, the system can instantly identify a returning customer in retail, allowing for real-time personalization and then providing that profile back to a personalization system for activation. In many connected systems, the profile is redundant, and the CDP doesn't need to update the personalization solution because it houses the profile already in a pre-resolved state and keeps a record of events. I think it is better to pre-resolve the profile and make it available to the real-time system as a real-time profile, thus limiting these redundancies and potential false positives.

Rules Engine

The rules engine serves as a cornerstone of the real-time index approach, offering customizable conditions and rules for identity matching. This level of customization allows businesses to adapt the system to their specific requirements, from straightforward email matching to more complex, multi-attribute conditions. However, this flexibility also brings the risk of human error, which can compromise the integrity of the identity resolution process by leading to mismatches. This step is not much different across any approach. All Identity resolution tools within a data platform should have some rule customizability.

In summary, the real-time index approach presents both advantages and challenges. While it excels in speed and customization, it also struggles with complexity, resource requirements, and a potential decrease in accuracy as the data scales. As elaborated in Article 1, businesses should carefully consider if this approach aligns with their specific needs and resources. The CDP should be separated from the identity process and become more of a sink for real-time profiles that update real-time campaign delivery systems. Most CDP vendors should operate this way, and separate the CDP as a Data app, creating SKUs of products for different purposes, which is a much cleaner, more accurate, and more stable approach. In other words, the matching process does not need to be real-time, but the profile's querying afterward needs to be real-time. Instead of selling a massive CDP product, create a composable component-based architecture where specific purpose-built apps can be individually sold, spliced, and integrated where needed.

Those seeking a more balanced method could explore hybrid models that incorporate probabilistic matching elements, like those offered by Zingg.com. While not a real-time ID resolution solution, it could be leveraged as a pre-resolved ID sink to store profiles and make them available to a real-time marketing delivery solution.

For CDPs, continuous real-time matching and merging of profiles might be overkill. A more pragmatic approach is to make the resolved profile available in a 'gold' audience dataset. This dataset can then be segmented into smaller, more targeted audiences, accessed by activation triggers, and integrated into various marketing channels like ad networks, personalization engines, and journey orchestration tools. Hopefully, Real-time solutions will move in this direction and be more of a hybrid solution.


Graph dB-Based Approach

Graph-Based Resolution (ID Graph)

In identity resolution, the Graph-Based Resolution approach, commonly referred to as ID Graph, stands out for its innovative use of graph databases. Each entity, such as a customer or device, becomes a node in the graph, and these nodes are interconnected by edges representing various relationships, like email addresses, phone numbers, or social IDs.

Node and Edge Creation

In the case of Amperity, a CDP vendor that runs on a custom AWS architecture, nodes are meticulously crafted within their graph database. Amperity uses Spark and a customized Deltalake foundation, then processes the data into Graphdb to build these nodes, which form the structural basis for identity resolution. Edges between nodes signify relationships, and these can be as straightforward as an email connection or as complex as a behavioral pattern observed over time. Amperity is a hybrid, but their specialization is a Graphdb using probabilistic algorithms for their ID resolution process. Their solution would greatly complement other platforms where activation or data collection are more specialties. For example, Adobe and Salesforce could leverage Amperity's specialization and complement their deep marketing activation solutions.

Graph Traversal for Resolution

Graph traversal serves as the cornerstone of identity resolution within a graph database. Amperity employs an advanced graph traversal technique, Stitch, to navigate through nodes and their associated edges efficiently. This specialized approach is further augmented by its probabilistic algorithms like pair-wise and similarity matching to ensure a comprehensive identity resolution process. Given their targeted focus on identity—specifically for large organizations' marketing departments—their approach becomes vital for successfully scaling the graph with increasing data volumes. By employing these advanced techniques, Amperity identifies direct and inferred relationships, offering an invaluable resource for creating a holistic, 360-degree view of the customer.

Google sourced image of Amperity's graph db visualization tool

Rules Engine

Amperity incorporates a rules engine similar to other approaches to bring in an extra layer of customization. This customization feature allows businesses to define how nodes and edges should be evaluated or weighted during the traversal process. Creating custom rules lets you fine-tune the identity resolution process to align with specific business goals or data peculiarities. I need insight into deeper features like standardization, enrichment, or reverse append relationships, as noted in Medallion, where the graph ID CDP vendor could incorporate NCOA, Melissa, or the USPS.

Google sourced image of Amperity's Stitch dashboard.

In summary, the graph-based approach to identity resolution offers many advantages owing to its architectural strengths. Its unique focus on customization and efficiency positions this method as a standout option in the CDP market. For businesses seeking robust, scalable, and highly personalized identity resolution solutions, the graph-based model provides an unparalleled framework that can adapt and scale according to evolving needs.


Vector dB-Based Approach

Vectorization

In simple terms, vectorization is the process of converting various features or attributes—such as consumer behaviors and preferences—into a mathematical form known as vectors. Think of a vector as a list of numbers where each number represents a specific characteristic of the customer. This conversion into a mathematical form makes it easier to perform complex calculations that help understand and resolve customer identities. Many of the new large language models for Gen-AI use a type of vector dB to store and traverse data.

An example of a type of Vector dB - SAS Institute calls this Customer State Vector - this is an older example of their newer approaches. SAS was way ahead with their VectorDB in 2011 when I was there. 

Traditional methods often focus on finding exact matches, like identical email addresses or phone numbers. However, the Vector dB-based approach furthers it by searching for similar vectors. In layperson's terms, the system is intelligent enough to recognize that two slightly different data points might pertain to the same individual. This nuanced understanding allows for a more flexible and often more accurate identity resolution.

Source: Milvus - Vector databases are designed for query and data mining.

Like in other approaches, a rules engine can be added for extra customization. This customization allows businesses to set their criteria for what constitutes a "similar" vector, enabling a high level of specificity tailored to each organization's unique needs.

Relevance to CDPs

In the Customer Data Platforms (CDPs) context, vectorization offers opportunities and challenges. On the plus side, this approach can be incredibly effective for businesses aiming to understand customer behavior and preferences at a deeper level. It's advantageous in scenarios where data is complex and multifaceted. The downside, however, is that this method can require a high level of computational power, which could lead to increased operational costs.

Benefits and Trade-offs

The major advantage of using a Vector dB-based approach is its ability to provide a more nuanced and flexible system for identity resolution. It can be particularly useful for marketing executives targeting audiences or tailoring campaigns. On the flip side, the computational intensity of this method can be a drawback, particularly for smaller businesses with limited resources.

In summary, SAS Institute uses a vector dB for many of their customer intelligence tools and solutions, as do many Entity Resolution vendors. Others like Milvus is a modern open-source VectorDB option. The specific CDPs that employ only ID Resolution will probably not focus on this approach due to DevOps costs or other technology infrastructure that would add to their overall platform costs. The Vector dB-based approach is well-suited for businesses that require an advanced, nuanced method of identity resolution, mainly when employed by CDP vendors aiming to offer more targeted marketing solutions. However, organizations should know the increased computational requirements and adjust their resources accordingly. You must dig deep to understand whether a CDP vendor uses this approach. We may see more of this approach due to Large Language models. Gen-AI and VectorDBs may be sub-component or composable services within a large platform to perform specific tasks and functions along the data pipeline value chain.


Blockchain-Based Approach

Immutable(non-changable) Records

A blockchain is a digital ledger that records transactions and is tough to change or tamper with. In the context of Customer Data Platforms (CDPs), this means that once a customer's identity is verified and recorded on the blockchain, it becomes a permanent, "immutable" record. For a business and marketing executive, this could mean a much higher trust and security level in the data used for marketing decisions. It's unknown how a marketer will use this ID today because no high-volume, real-time campaign delivery solutions use the blockchain ID. However, Apple's newest entry into VR and AR with their new headset could open the environment to a different ID scheme. One option is the blockchain DID(decentralized ID), which allows users to create multiple identities and own their identity data vs. the corporation owning their data. The Apple headset may elicit deeper investment in blockchain identity and campaign delivery technologies for marketing within this new and exciting environment. Apple could radically change the marketing landscape and User ID ownership. Let's explore blockchain IDs a bit more in this section.

Smart Contracts for ID Matching

Smart contracts are self-executing contracts where the terms are directly written into code. In simpler terms, they're like automated agreements that perform certain actions when specific conditions are met. For example, you can program a smart contract to automatically match a customer profile in your CDP database with a new set of incoming data based on specific rules you set up. You could decide if the email and last name match, the same customer, and if their profile should be updated rather than creating a new entry. When applied to identity resolution, smart contracts can automatically match customer profiles based on predefined rules set by the business. The smart contract enforces the rules, doing the work automatically once programmed, so there's no need for a human to go in and double-check.

Decentralized Identifiers (DIDs)

Decentralized Identifiers, or DIDs, are unique identifiers not owned or controlled by any single entity or company. Instead, they are secured on a blockchain. This ID creation is the current bottleneck for using DIDs as it's extremely slow and computationally intensive. However, this approach means, in practice, that an individual has more control over their identity data. In a traditional setup, your data is stored in centralized servers owned by each corporation you interact with, and they are responsible for securing it. With DIDs, you, as an individual, would hold your identifier, meaning you have much more control over your data. Each customer would have a unique, secure identifier that only they could modify, making identity theft or fraudulent activities significantly harder. Some of the current providers creating blockchain DID technology are few but could become significant in the next few years. I only know of one Blockchain CDP, FUZECDP, and their parent company Passage Protocol, is the only one I know of using blockchain to build marketing technologies. Their CEO Zac Choi is a visionary about approaching this area to address the emerging marketing opportunities. Here are a few others to watch in the Blockchain DID space.

How Do Smart Contracts and DIDs Interact?

Here's where things get interesting. Imagine owning your DID and agreeing to interact with a company's CDP for marketing purposes. A smart contract could be set up so that it only accesses or uses your DID for marketing activities when you give permission. You can specify conditions under which your DID can be accessed, like for a particular marketing campaign or timeframe. Because you control your DID, you must grant permission for this access, usually through your own secure verification process. The smart contract automates this interaction based on the conditions you and the company agree upon.

Relevance to CDPs

For businesses using Customer Data Platforms, integrating blockchain could augment existing methods for identity resolution. It adds a layer of security and automation that could be incredibly beneficial for firms dealing with high-stakes data or those who want to improve the reliability and efficiency of their customer data management. The biggest roadblocks are the content delivery systems for campaigns, customer experiences, and personalization. None of them use smart contracts or blockchain DIDs. The next-gen martech stack may have to re-tool or integrate into these newer Owner-led ID systems. I believe this approach will solve the privacy and regulation headaches worldwide today and give data ownership back to the user. Marketers should watch Apple very closely in this regard as their new Apple VR/AR headset may lead the charge in this new realm of marketing and Martech/Adtech.

The User-Corporate Dynamic

This new ID ownership model puts some power back in the hands of the individual and could be a game-changer for data privacy. Companies won't be able to use user profile data however they want; they will need to follow the conditions set in the smart contract, which can be as restrictive or permissive as the ID owner is comfortable with. This new approach offers a new level of control and customization for the consumer while providing businesses with a secure, efficient, and customer-friendly way to manage identity resolution that will reduce liability, lower regulation compliance costs, and make the system less susceptible to fraud and errors.

In summary, implementing blockchain comes with its challenges. The technology is still relatively new and can be complex to set up and maintain. It may require significant investments in terms of time and resources. There's also the matter of scalability; as more records are added, the computational power needed to maintain the blockchain could grow, increasing operational costs. If adopted, this cost will decrease because the benefits are too great. Forget about Bitcoin and the crypto-bros., like FTX and all the fraud incurred. That is not what this is about. The token will still exist but in a different form of usage, not blasted about as an investment vehicle. It will be part of a martech system as strictly an ID.

A blockchain-based approach to identity resolution in CDPs offers high security and automation but may require significant investment and technical expertise. The payoff could be substantial for businesses investing in enhanced data integrity and customer trust.


Conclusion

In this article, we've unpacked various approaches to identity resolution, each offering unique advantages and challenges, from traditional methods like Medallion to cutting-edge technologies like Vector databases and Blockchain. As a marketer or business leader, your choice of method will hinge on several factors, including your team's technical capabilities, the cost of the solution, your specific needs for identity resolution, and your willingness to invest in newer technologies.

The world of identity resolution is continuously evolving, making it crucial for decision-makers to stay informed and agile. Whether you're considering adopting advanced computational tools or exploring the enhanced security of Blockchain, your decisions today will lay the groundwork for your future marketing strategies. As the landscape shifts with potential disruptors like Apple entering the fray, being adaptable and informed will be vital to maintaining a competitive edge in personalized marketing.