UPDC (USPTO Patents Database Construction)

Configuration

We highly recommend the Linux server to build the database and run all the following scripts in order to download, parse and populate all the data. Configuration information is listed below:

OS: Linux/Unix, Windows, or Mac OS X
Python Version: 2.6 download
External Module: MySQLdb download
Database: MySQL 5.0 or upper download

Programs Execution

Build the database

Download the database schema file ‘/Patents_Model&EERDiagram_20120804.mwb’ and open it in the MySQL workbench, then use menu [Database]-[Forward Engineer] to create the database.

Download the SQL scripts file ‘/createDatabaseSQL_0801.sql’ into a specific filePath and run it using the following command in your MySQL command line: “mysql > SOURCE filePath”

Run Python scripts

Download the compressed project file ‘USPTOPatentsDatabaseConstrucation.zip’ and uncompress it. Please keep the original location of all the folders (CLS, CSV_G,CSV_P,CSV_PAIR,ID, CSV_LOG, PAIR, PG_BD, PP_BD) with its contained files you have downloaded.
Run GrantsParser.py to download, parse, and populate Patent Grants data into the database automatically. The parser gets all the patent grants downloadable hyperlinks through SourceParser.py which checks Google website. All the data will be downloaded firstly, and then their formats will be identified by their file names. This parser uses different format functions to obtain all the contained data in these packages. Then the parser extracts all the metadata and populates them into the database in a certain sequence.
Run PublicationsParser.py to download, parse, and populate Patent Application Publications data into the database automatically. The patent application publications parser uses the same processing strategy as GrantsParser.py which downloads zip packages from Google website and then parses them into the database.
Run ClassificationParser.py to parse and populate Patent Classifications data stored in the folder of ‘/CLS’ into the database automatically.
Run PAIRParserSeg.py to download, parse and populate Patent Application Information Retrieval (PAIR) data into the database automatically. There are terabytes of PAIR data hosted in Google, so the consuming time depends on your network speed and server hardware. PAIRParserSeg.py gets all the PAIR data downloadable hyperlinks firstly, and then divides them into many segments. The parser creates ten processes processing 1000 packages at one time due to the network speed, and all the packages are deleted after the extraction and loading of the data in order to save space for your server.
AutoUpdater.py is designed to check new updates and populate them into the database automatically. The parser obtains new unprocessed hyperlinks by comparing all the downloadable hyperlinks on Google website and the list of files processed successfully in the LOG file. Then new hyperlinks are transferred and processed by appropriate parsers. We highly recommend running this automatic updater per week because the USPTO data are updated once a week on Google (always on Tuesday).

Loading all the .csv files

‘MySQLLoader.py’