I have a table with a primary id column (automatically indexed), two sub-id columns (also indexed), and 12 columns of type geometry(Polygon). If I am going to update ~2 million rows on this table, is it faster for me to run ~2 million individual update statements like
update TABLE_NAME set ( COLUMNS ) = ( VALUES ) where ID_COLUMN = NEXT_IDor is it faster to do some smaller number of larger update statements like in this answer
update TABLE_NAME as update_t set
COLUMNS = new_vals.COLUMNS
from (values
(id, polygon1val, polygon2val, ... polygon12val), /* row 1 */
(id, polygon1val, polygon2val, ... polygon12val), /* row 2 */
... /* ... */
(id, polygon1val, polygon2val, ... polygon12val) /* row N */
) as new_vals( COLUMNS )
where new_vals.id = update_t.idIf the latter, do you have any suggestions on what a good N might be? Is N = ~2mil, or some smaller subset (that I would repeat until they're all done)?
EDIT: Obviously, in the former case I would use a prepared statement. But I also wonder, in the latter case, is there any benefit in trying to use a prepared statement?
I'm using PostgreSQL 9.2.
11 Answer
In general, the better you can batch operations into sets the more options the database has of making things fast. If you run the updates individually, the only option is something like "locate the one row affected, delete it, insert new one"
If you can batch the updates then the planner gets to decide whether a sequential scan may be faster than a bunch of index scans (and it may well be, since you get to leverage read-ahead caching). In other words, one command updating a lot of rows almost always performs better than a lot of commands updating a single row, even aside from planning overhead.
3