| This page is user contributed documentation. See the bottom of the page for information about the author. | |
PostgreSQL's "COPY" statement is an excellent way to load large amounts of data quickly into a database.
The main PostgreSQL documentation is the definitive guide, and I recommend studying it.
The documentation is quite easy to understand, once you know it
:-). If you are like me, it probably looks a little
forbidding at first. I hope the following notes, based
on my (sometimes painful) experiences
with COPY, will help make your experience with this useful utility
more pleasant and rewarding.
Here is the syntax for COPY, from the main docs:
COPY [ BINARY ] table [ WITH OIDS ]
FROM { 'filename' | stdin }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH NULL AS 'null string' ]
COPY [ BINARY ] table [ WITH OIDS ]
TO { 'filename' | stdout }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH NULL AS 'null string' ]
COPY will be run by the PostgreSQL backend (user "postgres"). The backend user requires permissions to read & write to the data file in order to copy from/to it, and needs to be able to find it.
Therefore you need to use an absolute pathname. You will also need to have insert/update or select permission on the table in order to COPY to or from it.
Assuming these requirements are met, you are ready to happily
populate your database by doing:
COPY bleah FROM 'blah';
from the psql prompt.
TIP: Try creating a test table, inserting some data like that which you want to import. Then COPY the content to a file and inspect the results. This will tell you a lot about what COPY wants.
COPY is not terribly smart. All it really does is split each line using the appropriate delimiters, and attempt to line up each field in your data with each field in the table, counting from the left. If the parser accepts your data, in it goes. No parsing is done by COPY itself, so the data is handed off in just the state you have it.
I find problems usually fall into the following (somewhat overlapping) categories:
You might be amazed (or maybe not) at the kinds of things that DB users are capable of inserting into your data. Be prepared to do some editing to get your data accepted by COPY (more on this later). Here are some examples of the kinds of problems that I have encountered:
If your data file is large, using an interactive editor is not a good idea. Most times, your edits will require some scripting. If you are not already familiar with scripting in tools like sh, sed, awk or perl, time spent on learning will be rewarded. You can do a lot with a little knowledge. Coverage of these methods is beyond the scope of this little article (you probably know more than I do anyway!).
Carriage returns
You can avoid these altogether by moving your data files via FTP,
using the "ascii" transfer method. This will automatically adjust
the line endings for you.
You can just delete the carriage returns using a simple script: "
tr -d '\r' < datafile " should work just fine.
I like to use GNU recode, which has lots of other nifty functionality as well. The command for that is recode /cl datafile. There are lots of other possible methods, mainly similar to these. Choose according to your taste.
My terminal emulator (SecureCRT) allows me to copy and paste quite large amounts of data directly from my Windows workstation, using "COPY FROM stdin;". It even renders MS Excel cells into lines of tab delimited text, with appropriate line endings! If you are using a terminal emulator, you may want to investigate its capabilities.
Rogue characters
Your choice is to delete them, perhaps with an innocuous
replacement, or to escape them. This requires inserting
a backslash before each offending character. If you
really want to keep your data just the way
it came to you (like I usually do), this is the way to
go. Note that this is still subject to the limits
imposed by the applicable PostgreSQL data type.
Others
Other situations, such as embedded delimiters or incorrect data
formats, require that you find some pattern in your data which
will allow you to identify precisely what needs to be changed.
i.e. You will need to delete the commas in "1,203,327", but you probably don't want to delete all of the commas in the whole line.
Today's popular scripting languages offer good methods for doing this kind of thing.
You can also use the psql client version of COPY, which is "\copy". The syntax is slightly different:
Generally I find \copy works well. The only quirk I have found is it sometimes fails to return an error message that otherwise you would get from COPY; and sometimes will hang in that case as well. Check the result carefully in this case.
The behaviour of COPY is not symmetrical. If you COPY data into a table already containing data, the new data will be appended.
If you COPY TO a file already containing data, the
existing data will be
overwritten.
If you are doing repeated tests of your COPY, be careful you do not duplicate the data in your table.
The quickest way to clean out a table after a test is "TRUNCATE tablename;".
Default values are not filled in by COPY, as an INSERT would
do. You will need to explicitly supply a value for
every field. If you are using a "serial" type field,
you will need to reserve some numbers by resetting the sequence
value:
SELECT setval ('sequence_name', new_value);
You can then insert the reserved numbers into your data file, and
COPY away.
Indexes on your table will slow down your COPY too, perhaps by an order of magnitude. For a large data file, you may be better off dropping your indexes and recreating them afterwards. A script is an easy way to do this.
Because each COPY runs as a single transaction, an extremely large data file might lead to performance problems. If in doubt, just split your file into chunks.
When editing, I often like to do a one-shot, like this:
sed -e 'edit_script' data_file | psql -c " copy mydata from stdin with null as '' " -d mydb&
Check the man page of psql for more information.