All Articles
Tech

Fetching millions of rows in Python w/ psycopg2

I ran into an issue with psycopg2 getting "stuck" on querying a large result set. cursor.execute("...") would run fine and so would cursor.fetchone(), but cursor.fetchmany(BATCH_SIZE) with a moderate value of BATCH_SIZE=100 would get stuck _.

It seemed to take an awful lot of time to fetch just 100 rows of the result set. The query wasn't too complex so it should've executed fairly quickly. Looking at RDS logs, everything looked fine.

Turned out that psycopg2 fetches the entire result set to the client by default — unless you use a named cursor (server-side cursor). When you use a named cursor, the result set is maintained on the server-side allowing you to fetch rows as necessary. You can also control the no. of records fetched over the network in each go by setting the cursor.itersize property to a reasonable value (default 2000).

#!/usr/bin/python
import psycopg2


def main():  
    conn_url = 'postgresql://{username}:{password}@{host}:{port}/{dbname}'.format(
        username='',
        password='',
        host='',
        port='',
        dbname='')

    conn = psycopg2.connect(conn_url)
    cursor = conn.cursor(name='fetch_large_result')
    cursor.execute('SELECT * FROM <large_table>')

    while True:
        # consume result over a series of iterations
        # with each iteration fetching 2000 records
        records = cursor.fetchmany(size=2000)

        if not records:
            break

        for r in records:
            # do something with record here

    cursor.close() # don't forget to cleanup
    conn.close()

if __name__ == "__main__":  
    main()