Finding duplicate records

Tagged: 

Viewing 17 reply threads
  • Author
    Posts
  • December 31, 2016 at 7:53 AM #20701

    Peter Riley
    Participant

    Hi,

    Is there a way to list duplicate records? I have a ‘Movies Watched’ database, with Actors, Directors and Writers as individual forms, linked to the main Movies Watched form on a many-to-many basis. I’m finding that when I link an actor’s name on the main form I sometimes end up with that person appearing twice in the Actors records.

    Thanks

    December 31, 2016 at 4:37 PM #20703

    Brendan
    Keymaster

    No there’s no Find Duplicate Records function. You can sort by actor name and then just scan them manually to find the duplicates.

    October 18, 2017 at 2:12 PM #25451

    Efren Santos
    Participant

    After changing the sync option from bropbox to iClound may database duplicate some records, wish that there is a search function to find the duplicate to easily delete.

    October 18, 2017 at 3:53 PM #25452

    Brendan
    Keymaster

    Your best bet is to sort your records so you can more easily spot the duplicates and then delete the ones you don’t want.

    January 6, 2021 at 3:54 AM #43076

    JB Be
    Participant

    Hi,
    Is there in the meantime a function “Find duplicate entries” implemented? It seems to me a very fundamental function for a database. I am often consolidating lists from various sources into Tap Forms and am totally lost without efficient means to consolidate entries with e.g. the same identifier (mostly email address). Excel has it and even early FileMaker had it, too.
    Thanks for a – hopefully positive – reply.
    Be.

    January 23, 2022 at 9:51 AM #46391

    Bert Rietveld
    Participant

    I have over 84,000 records. Sorting and manually scanning would not work. Is there a script to find duplicates?

    January 23, 2022 at 5:42 PM #46393

    Sam Moffatt
    Participant

    Are you looking for exact duplicates (all fields identical) or duplicates on a primary or subset of key fields? If you want all fields identical, does this extend to link to form fields, table fields or photo fields and their contents?

    I was working a while back on tooling to do merging of records through the scripting interface to handle when I end up with duplicate shipping records due to key mismatch (sometimes the tracking numbers are changed in transit) but I never got it finished because the UI experience wasn’t something I figured out.

    If you’re after a subset of key fields, I did a quick POC where you create a new field in your source form with the key. I already had a composite key field built using a calculation field I use which looks like this (square brackets for field placeholders):

    IFEMPTY([Marketplace];"No marketplace";
    CONCAT([Marketplace], "/",[Store Name], "/",[Order ID]
    ))
    

    This creates what should be a unique key for the record based on my own metadata (designed to handle direct sales and hosted marketplaces). I then created a new form called “Orders Dedupe” and put in it three fields: a *string* type field called “key”, *link to form* field called “Order Dedupe” and a script field which counts the entries in the order. The link to form field is configured as a JOIN type on the “key” field of the dedupe form and the calculation field in the original form. The script field looks like this (change your ID’s to match):

    function Key_Match_Count() {
    	var order_dedupe = record.getFieldValue('fld-fde68e7d2b384cb2a4452d3ae66bbab1');
    	return order_dedupe.length;
    }
    
    Key_Match_Count();
    

    In this form also create a new saved search that uses the script field and is set to look for values greater than one as those will be the duplicates.

    Last step is to populate this form, go back to your base form and create a new form script. I wrote the script below to scan each record, use an md5sum implementation to create a hash of the key field and then look to see if that record exists in the dedupe form:

    document.getFormNamed("Script Manager").runScriptNamed("md5sum");
    
    var purchase_key_id = 'fld-3e49aaa5bc32429c8f0f0f234878356d';
    var dedupfield__key_id = 'fld-c52906ee940142a0a54fac1a98346afd';
    var dedupForm = document.getFormNamed("Order Dedupe");
    
    function Extract_Purchase_Keys() {
    	for (let sourceRecord of form.getRecords()) {
    		let purchaseKey = sourceRecord.getFieldValue(purchase_key_id);
    		if (!purchaseKey) {
    			console.log("Missing purchase key for record: " + sourceRecord.getUrl());
    			continue;
    		}
    		let purchaseKeyHash = "rec-" + md5(purchaseKey);
    		let dedupRecord = dedupForm.getRecordWithId(purchaseKeyHash);
    		if (!dedupRecord) {
    			dedupRecord = dedupForm.addNewRecordWithId(purchaseKeyHash);
    			dedupRecord.setFieldValue(dedupfield__key_id, purchaseKey);
    		}
    	}
    	document.saveAllChanges();
    }
    
    Extract_Purchase_Keys();
    

    This actually found me a dupe record that I hadn’t found in my orders form when I went back to look at the saved search. It’s a bit of a journey, might turn it into a video at some point when I get some more time.

    January 24, 2022 at 2:35 AM #46412

    JB Be
    Participant

    Fantastic. I filed my ‘duplicate’ question more than a year ago (see more upwards) and I am still in need of a solution, too. Thank you for taking this up, Sam. It all looks a little cryptic to me, though, being an informed user but not fluent in scripting…

    In my case, I am using the email-Adress as a unique identifier for 2000+ records. So I am looking for a (scripting) way to find records that have exactly the same email address in the specific field named ’email1′.

    In rare cases, there might be (erroneously) several email addresses in this field. In this case, I would look for records that have the same email address at least once in the specific field named ’email1′.

    Would you think the script above might help?

    January 25, 2022 at 1:05 AM #46420

    Sam Moffatt
    Participant

    The “OR” use case is a little different because you need to split out the individual values first. Thinking a little about this what might work is a form of email addresses and then a link to form M:M relationship. Each email address is split up and added to the other form and linked back. Since it’s M:M you can link multiple parent records and then count that way to unique email addresses.

    I created a simple form called “Source Form” and added a name and email address field to it. I then added a second form called “Emails” and added a field called “Email” to it. In the “Source Form” I went back and added a Link to Form to the “Emails” form, set it to M:M and ticked “Show inverse relationship” (I do that last because there used to be a bug where if you changed the type from 1:M to M:M, the inverse field wouldn’t update properly, I think it’s fixed). Back in our “Emails” form, I added a script field that just returns the length of the Link from Field that was created (it’s as simple as this: record.getFieldValue('fld-2b0ad00a96bd4cf4b20dca95899a7a5a').length; where the field ID is the ID of the link from form field). Last piece is to grab the md5sum.js file and add it to the “Source Form” as a new form script and then create another form script that I called “Extract Emails” with this in the contents:

    form.runScriptNamed("md5sum");
    
    var source__email_id = 'fld-4e8071f8f3ce4a75a66954f6a3c636ef';
    var emails__email_id = 'fld-7e880e79867345549fb04f377412fefd';
    var emails_link_id = 'fld-f5d00c535c3c437a87a262f6d0f434e4';
    
    var emailsForm = document.getFormNamed("Emails");
    
    function Extract_Emails() {
    
    	for (let currentRecord of form.getRecords()) {
    		let emails;
    		try {
    			emails = currentRecord.getFieldValue(source__email_id).split(' ');
    		} catch(e) {
    			console.log("Error processing record: " + e);
    			console.log(currentRecord.getUrl());
    			continue;
    		}
    		
    		for (let email of emails) {
    			console.log(email);
    			let recordHash = "rec-" + md5(email);
    			let candidateRecord = emailsForm.getRecordWithId(recordHash);
    			if (!candidateRecord) {
    				candidateRecord = emailsForm.addNewRecordWithId(recordHash);
    				candidateRecord.setFieldValue(emails__email_id, email);
    			}
    			currentRecord.addRecordToField(candidateRecord, emails_link_id);
    			document.saveAllChanges();
    		}
    	}
    }
    
    Extract_Emails();
    

    The first line imports the md5sum script we created. The next three var statement lines are the field ID’s for the “Email” field in the “Source” form, the field ID of the “Email” field in the “Emails” field and then the field ID of the link to form field in the “Source” form that links back to the “Emails” form. The emailsForm is getting the Tap Forms object representing the “Emails” form.

    The function then iterates over all of the records in the current form (e.g. “Source” form), it then splits up the email field based on a space. You can change this to be a comma or any other character you want. If you need more than one separator you will need to use a regular expression but we’ll skip that. The try/catch is just in case there is something weird to trap the error there and keep processing since we’re dealing with a field value.

    The second for loop is then iterating over the emails in the record and creating a new record ID using the MD5 of the email. This pattern is somewhat risky because we intentionally will create colliding document ID’s that generally Tap Forms will avoid. We do that here to be able to do the key lookup without having to build the index ourselves (TF will do that for us). We check to see if a record already exists using getRecordWithId and if it doesn’t we create it using that same ID using addNewRecordWithId and set he email address. We then have Tap Forms link the “Source” record we’re working with to the new email record and save all the changes.

    The saveAllChanges is inside the loop because I’ve had issues with TF where the links behave weirdly if you manipulate too many of them via scripting without saving. There’s probably a bug somewhere but reproducing it is a challenge so this is my work around.

    This should fill up the “Emails” form with the unique email addresses all linked to their relevant parent forms. I had to do a manual refresh of the form because the script field didn’t populate the links properly. Once I’d verified my duplicates were being detected, I created a saved search on the count of linked records being greater than one.

    Attaching a form template with that in it as an example to play with. The earlier one I did a quick change based on an existing form so not so easy to share because it’s got a bunch of other fields in it but this is a relatively concise use case and should work to handle multiple email addresses as well.

    Attachments:
    You must be logged in to view attached files.
    January 25, 2022 at 9:10 AM #46424

    JB Be
    Participant

    Thank you so much! What a kind help. What incredible responsiveness!

    The proposed solution is obviously already able to address the more complicated case of multiple email addresses in the same field. I must admit, however, that even despite the detailed explanation, I am still challenged. Because the ‘Duplicate’ function is missing, Tap Forms Organizer remained rather worthless for me and I have put it aside for months (but thanks to your help, that will probably change now!). Now I have to work my way back into the overall functionality of Tap Forms Organizer from scratch.

    What would it look like if there was only one sole email address in the field to assess? Would the script then be much simpler and even more transparent for the novice willing to learn?

    January 25, 2022 at 12:42 PM #46426

    Sam Moffatt
    Participant

    So the outer for loop to go through all of the records in the current form is required regardless of approach (first or second) though the inner for loop wouldn’t be required but the contents of the loop would still be required even if dealing with a single value. The variable names are slightly different between the two examples but you can see a similar pattern of form.getRecords then getFieldValue then creating the rec- prefixed ID, getRecordWithId to see if it exists, and if it doesn’t addNewRecordWithId to create the record and setFieldValue to set the value. The second one doesn’t use a JOIN so it needs to explicitly call addRecordToField to make the link to form field work properly.

    I’ve added it to my todo list for the videos, let me see if I can sneak some time in tonight to work through the process in video form because it becomes a little easier I feel when you can see the pieces coming together and follow along (also why my videos are uncut to ensure that everything is visible step by step).

    January 25, 2022 at 12:55 PM #46427

    JB Be
    Participant

    Fabulous help. I start seeing the patterns. I am sure, a step-by-step video would further help clear the fog…

    January 28, 2022 at 10:21 AM #46484

    Sam Moffatt
    Participant

    Ok, the step by step video recreating that form is live though I did run into a weird bug where the links were being duplicated on subsequent executions for some reason so that might be in the Keymasters domain to fix. Hopefully this helps bring a little more light to the situation :)

    January 28, 2022 at 8:14 PM #46504

    Brendan
    Keymaster

    So I’ve been thinking about this today after watching your video. Thinking about how I could offer duplicate finding in Tap Forms at the record level. It’s a complex issue to detect issue because of all the relationships to other objects and other forms.

    I saw that you were using a hash function so I was thinking that could be a way to know if one record is a duplicate of another. I could fetch all the records grouped by the hash value. All records with the same hash value should be identical. But more flexible would be an interface that lists all the fields that could be used to find only those records that are duplicates for the selected fields.

    But I wouldn’t want to have to deal with then digging into relationships.

    Then there’s the issue of displaying the results. It wouldn’t be too useful if it was just a big list of records. There would have to be different sections with the original record marked and all the duplicate records marked differently so they would be easy to see and then delete if so desired.

    Anyway, just thinking about this problem.

    January 28, 2022 at 10:06 PM #46508

    Sam Moffatt
    Participant

    So the simple case is a row hash of all of the field values (or some representation of it) but you run into issues where you need to internally dedupe photos and documents first so that you have a value there, perhaps through some sort of content addressable system. GIT uses this to hash it’s blobs and then identify them by their hash, so something like that would work. That solves photos/file attachment fields. I think if you then move into relationships, the JOIN type can be excluded because an identical record will have identical JOIN relationships, so you’re only looking at the regular join types. In that case I think you normalise down that to a set of linked record ID’s and don’t go any further. If we consider that we’re attempting to look for duplicate records, then a duplicate record is one that maintains the same number of outbound links.

    A relatively small implementation of a row hash would be to take the existing field values object, substitute the file metadata for their hashes and flatten down the link to form documents into a set of document IDs. Then you can hash the whole JSON representation of that object (or what ever serialisation makes sense) and that because your unique key. If you precompute that each time a record is modified (which should be easy since it’s already in memory), then you can build a really easy deduplication function.

    For those wishing to pay for the full table scan, you could easily adapt the above technique to only include specific fields they care about to handle composite keys for those people who have composite keys. Even further out could be configuring a set of “key” fields to generate the hash upon as well.

    As an aside, Teradata uses a conceptually similar row hash technique to handle both distribution of data to storage nodes and also automatic deduplication of inserted rows. If there is a primary key set it uses that to determine the hash and which node it should be inserted into.

    UI wise you could probably reduce it down to a special type of saved search with a synthetic link to form field interaction. You’ve built a pretty powerful system to showing rows and records already, I think with some thought you could just use the single column list view.

    November 27, 2022 at 5:19 AM #48322

    Patrick Donnelly
    Participant

    Man,
    I would be stoked to have it simply list dupes in one field.
    Better still (for me) if it deleted the second instance of matching records.
    At 110,000 records I really don’t want to wade through that manually.
    thanks,
    pd

    November 27, 2022 at 6:05 AM #48323

    JB Be
    Participant

    Thanks, Patrick, for taking this up. I admit that scripting is to hard for me (although I gave it a try). Getting a menu command as you described would save my investment in TF… :-|

    November 28, 2022 at 2:35 AM #48326

    Brendan
    Keymaster

    Are all your duplicate records identical? Or are there certain fields that are identical but maybe one of the fields might contain unique values between records?

    If there was a Find Duplicates function, there would have to be a decision made about what constitutes a duplicate record. Is it a duplicate if 80% of the field values are identical? Or must it be 100% of the field values are duplicates?

Viewing 17 reply threads

You must be logged in to reply to this topic.