Sandvox – PHP & MySQL

SQL – Removing duplicate records

October 31, 2012  |  Posted by admin |  Comments Closed


Sometimes when a database does not have a primary or unique key defined it is easy to end up with duplicate records in a table. Ensuring a correct data model is important but sometimes things can be overlooked. Now that it has happened we now need to find a way of removing these duplicate entries.

Firstly to identify a duplicate entry we need to determine what fields should be considered in determining a duplicate. For instance if we had a table invoices, we could say that the field invoice_number should be unique and we would check based on just this field. In a lab we could say that the fields batch_date and batch_number (where batch number is reset on a daily basis) represent the uniqueness of the data.

To determine the records with duplicate entries we would issue the following command:

SELECT invoice_number FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1;

or

SELECT batch_date, batch_number FROM batches GROUP BY batch_date, batch_number HAVING COUNT(*) > 1;


What this is saying is lets focus on entries that have a count per unique fields greater than 1. It is important not to check for the count per unique fields equal to 2 as there could be entries with more than 2.

Now that we can easily identify the duplicate records we now need to determine which ones should be removed. If we were to use the unique fields in a DELETE statement we would remove all entries.

-- Incorrect - this will remove all sets of duplicate records.
DELETE invoices FROM invoices i INNER JOIN (SELECT invoice_number FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1) x WHERE i.invoice_number = x.invoice_number;

The removal process is handled in different ways depending on the database server. We will discuss each one:

MySQL

MySQL has built in features (ALTER IGNORE TABLE tablename ADD UNIQUE INDEX (fields)) which allows duplicates to be removed through the addition of a unique index even with duplicate entries.

By using the IGNORE keyword the index gets created with the entries removed.

ALTER IGNORE TABLE invoices ADD UNIQUE INDEX (invoice_number);
ALTER IGNORE TABLE batches ADD UNIQUE INDEX (batch_date, batch_number);

Consider the following example:

mysql> create table invoices (invoice_number int, amount int);
Query OK, 0 rows affected (0.08 sec)

mysql> insert into invoices(invoice_number, amount) values (1,10),(2,11),(2,11),(3,99);
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from invoices;
+----------------+--------+
| invoice_number | amount |
+----------------+--------+
|              1 |     10 | 
|              2 |     11 | 
|              2 |     11 | 
|              3 |     99 | 
+----------------+--------+
4 rows in set (0.00 sec)

mysql> alter table invoices add unique index (invoice_number);
ERROR 1062 (23000): Duplicate entry '2' for key 1
mysql> alter ignore table invoices add unique index (invoice_number);
Query OK, 4 rows affected (0.03 sec)
Records: 4  Duplicates: 1  Warnings: 0

mysql> select * from invoices;
+----------------+--------+
| invoice_number | amount |
+----------------+--------+
|              1 |     10 | 
|              2 |     11 | 
|              3 |     99 | 
+----------------+--------+
3 rows in set (0.00 sec)

PostgreSQL

In PostgreSQL we can use the CTID which is a pseudo column in PostgreSQL which identifies the file block number and the position in the block. Duplicate records will still have a unique CTID value.

Using this knowledge we can remove the larger CTID entry.

DELETE FROM invoices WHERE CTID IN (SELECT MAX(CTID) FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1);

Now lets have a look at this in action:

rcashell=# create table invoices (invoice_number integer, amount integer);
CREATE TABLE
rcashell=# insert into invoices(invoice_number, amount) values (1,10),(2,11),(2,11),(3,99);
INSERT 0 4
rcashell=# select * from invoices;
 invoice_number | amount 
----------------+--------
              1 |     10
              2 |     11
              2 |     11
              3 |     99
(4 rows)

rcashell=# DELETE FROM invoices WHERE ctid IN (SELECT MAX(ctid) FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1);
DELETE 1
rcashell=# select * from invoices;
 invoice_number | amount 
----------------+--------
              1 |     10
              2 |     11
              3 |     99
(3 rows)

Oracle

We can use a similar method as was used with PostgreSQL with the use of the ROWID which is similar to CTID in postgreSQL as this identifies the datafile, the data block and position within the block.

DELETE FROM invoices WHERE ROWID IN (SELECT MAX(ROWID) FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1);

Let’s look at this in action:

SQL> create table invoices (invoice_number integer, amount integer);

Table created.

SQL> insert into invoices(invoice_number, amount) values (1,10);

1 row created.

SQL> insert into invoices(invoice_number, amount) values (2,11);

1 row created.

SQL> insert into invoices(invoice_number, amount) values (2,11);

1 row created.

SQL> insert into invoices(invoice_number, amount) values (3,99);

1 row created.

SQL> select * from invoices;    

INVOICE_NUMBER	   AMOUNT
-------------- ----------
	     1	       10
	     2	       11
	     2	       11
	     3	       99

SQL> DELETE FROM invoices WHERE ROWID IN (SELECT MAX(ROWID) FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1);

1 row deleted.

SQL> select * from invoices;

INVOICE_NUMBER	   AMOUNT
-------------- ----------
	     1	       10
	     2	       11
	     3	       99