Skip site navigation (1) Skip section navigation (2)

Peripheral Links

Header And Logo

PostgreSQL
| The world's most advanced open source database.

Site Navigation

This page is user contributed documentation. See the bottom of the page for information about the author.  

Using PostgreSQL's COPY function effectively

Using PostgreSQL's COPY function effectively

Last updated 17th September 2001

Introduction

PostgreSQL's "COPY" statement is an excellent way to load large amounts of data quickly into a database.

The main PostgreSQL documentation is the definitive guide, and I recommend studying it.

The documentation is quite easy to understand, once you know it :-).  If you are like me, it probably looks a little forbidding at first.  I hope the following notes, based on my (sometimes painful) experiences

with COPY, will help make your experience with this useful utility more pleasant and rewarding.

Basic basics

Here is the syntax for COPY, from the main docs:

COPY [ BINARY ] table [ WITH OIDS ]

    FROM { 'filename' | stdin }

    [ [USING] DELIMITERS 'delimiter' ]

    [ WITH NULL AS 'null string' ]

COPY [ BINARY ] table [ WITH OIDS ]

    TO { 'filename' | stdout }

    [ [USING] DELIMITERS 'delimiter' ]

    [ WITH NULL AS 'null string' ]

COPY will be run by the PostgreSQL backend (user "postgres").  The backend user requires permissions to read & write to the data file in order to copy from/to it, and needs to be able to find it.

Therefore you need to use an absolute pathname.  You will also need to have insert/update or select permission on the table in order to COPY to or from it.

Assuming these requirements are met, you are ready to happily populate your database by doing:



COPY bleah FROM 'blah';



from the psql prompt.

TIP: Try creating a test table, inserting some data like that which you want to import.  Then COPY the content to a file and inspect the results.  This will tell you a lot about what COPY wants.

Problems

COPY is not terribly smart.  All it really does is split each line using the appropriate delimiters, and attempt to line up each field in your data with each field in the table, counting from the left.  If the parser accepts your data, in it goes.  No parsing is done by COPY itself, so the data is handed off in just the state you have it.

I find problems usually fall into the following (somewhat overlapping) categories:

  • Data type mismatches
  • NULL confusion
  • Rogue characters

You might be amazed (or maybe not) at the kinds of things that DB users are capable of inserting into your data.  Be prepared to do some editing to get your data accepted by COPY (more on this later).  Here are some examples of the kinds of problems that I have encountered:

  • Embedded delimiters

    This is all too common, especially with user-created memo text.  COPY expects tabs as delimiters by default, but you can specify something else with "USING DELIMITERS 'whatever'".  If you have extra delimiter characters, COPY will find too many fields to fit your table, and.... you can guess the rest.  Most often this will show as a data type mismatch, as COPY attempts to stuff your text string into a date field or something similar.



  • Backslash characters

    I wonder what would possess someone to terminate their data entry with a backslash?  It really happens!

    This means trouble, because the following delimiter is thereby escaped, and no longer recognized as a delimiter.  All the fields get shifted down by one, with the result that (a) your COPY fails because of a data type mismatch, or (b) your data is silently accepted in a mangled state.  Either way, not good.



  • Non-printable characters

    Difficult to catch, because you can't see them.  Something to watch out for especially if your file is in fixed width format, as this means spaces or tabs have been used to line up your data.  These will be happily accepted by the parser for text type fields, but not for number or date types.



    If you are importing from Windows, be aware that some Windows-based software will accept spaces into number or date fields, and these may exist in your file.



    Other non-printable characters can be even worse, because they don't take up space, and are therefore practically invisible. If your COPY fails for this reason, you can find them by doing something like this:



    grep 'search pattern' datafile | vis -lw



    This will show you every non-printable character in the affected lines.



  • Carriage return characters

    If your data file was created on, or passed via a Windows machine, chances are that your lines are terminated by CR/LF combinations.  If the CR is not removed, it will end up in the final field of your table. If that field is a number or date data type, your COPY will fail.  If the field is a character data type, your COPY will succeed, and you will then be scratching your head trying to work out why comparisons using that field give such strange results.



  • NULL confusion

    COPY expects NULLs to be represented as "\N" (backslash-N) by default.  You can change this by using "WITH NULL AS 'something_else'".  If you have empty fields in your data, I strongly recommend using "WITH NULL AS '' ".  Otherwise, COPY will assume empty fields represent empty strings, and will bomb on the first empty number or date field.  Note that you cannot mix NULL representations!



  • Just the wrong data format

    Your data must match the format required by the relevant PostgreSQL data type.  See the docs.



    For example, if your integer looks like this:



    1,203,327



    it will not be accepted.  The commas will need to be edited out.  PostgreSQL recognizes a variety of date/time formats.  If you are unsure about yours, test a sample.

Fixing Problems

If your data file is large, using an interactive editor is not a good idea.  Most times, your edits will require some scripting.  If you are not already familiar with scripting in tools like sh, sed, awk or perl, time spent on learning will be rewarded.  You can do a lot with a little knowledge.  Coverage of these methods is beyond the scope of this little article (you probably know more than I do anyway!).



Carriage returns

You can avoid these altogether by moving your data files via FTP, using the "ascii" transfer method. This will automatically adjust the line endings for you.



You can just delete the carriage returns using a simple script: " tr -d '\r' < datafile " should work just fine.

I like to use GNU recode, which has lots of other nifty functionality as well. The command for that is recode /cl datafile.  There are lots of other possible methods, mainly similar to these.  Choose according to your taste.

My terminal emulator (SecureCRT) allows me to copy and paste quite large amounts of data directly from my Windows workstation, using "COPY FROM stdin;".  It even renders MS Excel cells into lines of tab delimited text, with appropriate line endings!  If you are using a terminal emulator, you may want to investigate its capabilities.

Rogue characters



Your choice is to delete them, perhaps with an innocuous replacement, or to escape them.  This requires inserting a backslash before each offending character.  If you really want to keep your data just the way

it came to you (like I usually do), this is the way to go.  Note that this is still subject to the limits imposed by the applicable PostgreSQL data type.

Others



Other situations, such as embedded delimiters or incorrect data formats, require that you find some pattern in your data which will allow you to identify precisely what needs to be changed.

i.e. You will need to delete the commas in "1,203,327", but you probably don't want to delete all of the commas in the whole line.

Today's popular scripting languages offer good methods for doing this kind of thing.

Other Bits & Quirks

You can also use the psql client version of COPY, which is "\copy".  The syntax is slightly different:

  1. Being a psql command, it is not terminated by a semicolon
  2. File paths are relative to your user directory (and need not be quoted)
  3. You cannot specify alternative delimiters.

Generally I find \copy works well.  The only quirk I have found is it sometimes fails to return an error message that otherwise you would get from COPY; and sometimes will hang in that case as well.  Check the result carefully in this case.

The behaviour of COPY is not symmetrical.  If you COPY data into a table already containing data, the new data will be appended.

If you COPY TO a file already containing data, the existing data will be

overwritten.

If you are doing repeated tests of your COPY, be careful you do not duplicate the data in your table.

The quickest way to clean out a table after a test is "TRUNCATE tablename;".

Default values are not filled in by COPY, as an INSERT would do.  You will need to explicitly supply a value for every field.  If you are using a "serial" type field, you will need to reserve some numbers by resetting the sequence value:



SELECT setval ('sequence_name', new_value);



You can then insert the reserved numbers into your data file, and COPY away.

Indexes on your table will slow down your COPY too, perhaps by an order of magnitude.  For a large data file, you may be better off dropping your indexes and recreating them afterwards.  A script is an easy way to do this.

Because each COPY runs as a single transaction, an extremely large data file might lead to performance problems.  If in doubt, just split your file into chunks.

When editing, I often like to do a one-shot, like this:

sed -e 'edit_script' data_file | psql -c " copy mydata from stdin with null as '' " -d mydb&

 

  • When loading a dump file generated from a different RDBMS, containing INSERT statements, you may be able to get it to load with little or no editing.



    However, there may be some difficulties with large dump files:



    • Each INSERT is a separate transaction, with BEGIN...COMMIT overhead.
    • Some INSERTs may fail, but the rest succeed, leaving you with

      problems syncronising your data.



    You could get around the first problem by splitting the file and

    surrounding the pieces with BEGIN...COMMIT statements.  Or you could try a different approach; edit the dump file to a delimited text file, and use COPY to import your data.



    If you want to try this, parse_inserts.pl is a Perl script which works for me.

Check the man page of psql for more information.



Jeff Eckermann
Updated: 2006-11-15 11:36
Author: Robert Treat
Long time PostgreSQL Contributor, Co-Author of Begining PHP and PostgreSQL 8
Version history:
  • 2006-04-06, Robert Treat
  • 2006-04-03, Robert Treat
Operations: Edit Create subpage (Requires community login)

Privacy Policy | Project hosted by hub.org | Designed by tinysofa
Copyright © 1996 – 2007 PostgreSQL Global Development Group