Configuration
We highly recommend the Linux server to build the database and run all the following scripts in order to download, parse and populate all the data. Configuration information is listed below:
- OS: Linux/Unix, Windows, or Mac OS X
- Python Version: 2.6 download
- External Module: MySQLdb download
- Database: MySQL 5.0 or upper download
Programs Execution
- Build the database You need to confirm that you have installed MySQL 5.0 or upper in your server or current os. There are two ways to create your database.
- Download the database schema file ‘/Patents_Model&EERDiagram_20120804.mwb’ and open it in the MySQL workbench, then use menu [Database]-[Forward Engineer] to create the database. OR
- Download the SQL scripts file ‘/createDatabaseSQL_0801.sql’ into a specific filePath and run it using the following command in your MySQL command line: “mysql > SOURCE filePath”
- Run Python scripts Your need to confirm that you have installed Python 2.6 and external Python module MySQLdb in your server or current os. There are several un-sequenced steps to run the scripts:
- Download the compressed project file ‘USPTOPatentsDatabaseConstrucation.zip’ and uncompress it. Please keep the original location of all the folders (CLS, CSV_G,CSV_P,CSV_PAIR,ID, CSV_LOG, PAIR, PG_BD, PP_BD) with its contained files you have downloaded.
- Run GrantsParser.py to download, parse, and populate Patent Grants data into the database automatically. The parser gets all the patent grants downloadable hyperlinks through SourceParser.py which checks Google website. All the data will be downloaded firstly, and then their formats will be identified by their file names. This parser uses different format functions to obtain all the contained data in these packages. Then the parser extracts all the metadata and populates them into the database in a certain sequence.
- Run PublicationsParser.py to download, parse, and populate Patent Application Publications data into the database automatically. The patent application publications parser uses the same processing strategy as GrantsParser.py which downloads zip packages from Google website and then parses them into the database.
- Run ClassificationParser.py to parse and populate Patent Classifications data stored in the folder of ‘/CLS’ into the database automatically.
- Run PAIRParserSeg.py to download, parse and populate Patent Application Information Retrieval (PAIR) data into the database automatically. There are terabytes of PAIR data hosted in Google, so the consuming time depends on your network speed and server hardware. PAIRParserSeg.py gets all the PAIR data downloadable hyperlinks firstly, and then divides them into many segments. The parser creates ten processes processing 1000 packages at one time due to the network speed, and all the packages are deleted after the extraction and loading of the data in order to save space for your server.
- AutoUpdater.py is designed to check new updates and populate them into the database automatically. The parser obtains new unprocessed hyperlinks by comparing all the downloadable hyperlinks on Google website and the list of files processed successfully in the LOG file. Then new hyperlinks are transferred and processed by appropriate parsers. We highly recommend running this automatic updater per week because the USPTO data are updated once a week on Google (always on Tuesday).
- Loading all the .csv files We also provide .csv files to be loaded into the database much more easily. What you only need to do is download all the .csv files and run ‘MySQLLoader.py’ to populate all the data into your database. Note that, after populating all the .csv files into the database, you need to download the existed log file (LOG_G, LOG_P, LOG_PAIR) in order to keep the accuracy of the automatic updating.