What is tokenization?
For example, people visit the Dept. of Motor Vehicles (DMV) office for various purposes: drivers license application, vehicle registration, or identity card acquisition, etc. How are these different kinds of requests handled at the DMV? In order to address different types of requests, first similar requests are grouped together by providing particular types of tokens. Similar kinds of tokens will be addressed in one queue.
In a similar manner the MDM tokenization process works. Different kinds of of tokens are assigned to different kinds of data. Similar kinds of tokens are grouped together and matched with each other during the matching process.
In the MDM tokenization process, similar records as well as exact records are grouped together. In order to achieve tokenization, we need to create fuzzy match rules. Exact match rules will not generate tokens as exactly matched records.
What are match tokens?
These are encoded and non-encoded representations of the data in base object records. The tokenization process generates one or more tokens for each record.
Match tokens include following things:
- Match keys are fixed-length, compressed strings consisting of encoded values built from all of the columns in the Fuzzy Match Key of a fuzzy-match base object. Match keys contain a combination of the words and numbers in a name or address such that relevant variations have the same match key value.
- Non-encoded strings consist of flattened data from the match columns (Fuzzy Match Key as well as all fuzzy-match columns and exact-match columns).
How are tokens generated?
Token generation is a very complex process. The generation of tokens is based on several factors: the match columns configurations, fuzzy criteria, and the length or width of tokens. For more details you can watch this video.
In which table are tokens stored?
- SSA_KEY : Match key for the record. It is the encoded representation of the value in the fuzzy match key column, such as the name, address, or organization name, for the associated base object record. The string consists of fixed-length, compressed, and encoded values built from a combination of the words and numbers in a name or address.
- SSA_DATA : Non-encoded, plain text, string representation of the concatenated match columns defined in the associated base object table. Concatenated match columns include the fuzzy match key as well as all fuzzy match columns and exact match columns.
Why does the tokenization process cause performance issues?
Performance issues either in the tokenization process or other processes depend on the factors below:
- Volume of records getting processed
- Hardware configuration
- MDM Hub configuration
- Application server configuration
- Data and Hotspot issues
Out of the above mentioned causes, the MDM configuration-related problems can be fixed by tuning the MDM hub and the other configurations in the properties files. Similarly, the application server- related configuration can be fixed.
However, to address volume or data related issues, we need to take inputs from business users and data stewards. Any architectural or hardware level issues must be addressed in the designing phase.