Submitting Large Requests

If you are interested in searching for or retrieving more than a thousand data sets, here are some suggestions that may help your requests run more smoothly. In the event you do have problems though, please contact the STScI help desk at archive@stsci.edu and explain the problem. We may be able to accomodate special requests.

The information below applies to all MAST missions (accessed via archive.stsci.edu) , and describes requests submitted from both a MAST search form, and a HTTP GET request. See the MAST Services page for more information on submitting GET requests. For more information on ways to submit HST requests and HST-specific restrictions, see the HST bigsearch_request web page.

Request Limits

First, it is important to distinguish between search requests and data requests. "Search Requests" refers to querying a database table and displaying the results. "Data Requests" refers to submitting requests to have data retrieved from the archive and copied to a media or location of your choice. Generally the search requests are executed using scripts written in PHP while data retrieval scripts are written in Perl. Although we have been increasing our limits (mainly for the Kepler project), there are several different restrictions that effect the allowed (and practical) size of both of these requests:

  1. Execution Time Limit: Currently requests have a 5 minute execution time limit. This is set by both our Apache web server and by our PHP scripting language. Perl scripts have no execution time limit but users are still subject to the 5 minute Apache limit. Requests taking more than 5 minutes should be divided into smaller requests.
  2. Data Search Limits: In theory, users can submit queries that will find up to 50 million rows, however by default, the standard search forms only display the first 1,0001 rows. Using the "max records" search form element, the number displayed may be increased up to 25,001. For the file upload forms, the current default is to retrieve 20 records per target but the limit can be increased up to 5,000 records per target. For HTTP GET requests, the max_records parameter defaults to 2,000 records, but may be set to any value. The more entries returned however, the longer the execution times, and the more memory required by the browser if results are displayed in HTML. This could affect javascript features such as sorting and paging results.
  3. Memory Limits: - As mentioned above, large requests can hit memory limits on both the MAST web server AND the users web browser. Server limits can depend on the number of concurrent users and is hard to quantify. Browser limits depend on the amount of memory allocated for programs such as those written in Javascript to allow sorting or paging of search results. Safari in particular is known to reserve less memory for Javascript than browsers such as Firefox or Chrome so users may want to try other browsers if problems occur.
  4. Data Retrieval Limit: Even though all the displayed search result entries can be submitted for data retrieval, there is a limit on the number of data sets that can be reasonably submitted in one request. Initial tests with retrieving Kepler light curves indicate requests for up to 10,000 data sets are possible. In one test, staging roughly 11,000 light curves took about 8 hours. The staff however are still investigating this limit.
  5. File Upload Limits: File uploads can include up to 10,000 entries. For Data ID's (or Kepler ID's), the results will be written to a single output table allowing all returned data to be retrieved simply by marking all rows and clicking the Submit button. For file uploads containing Target names or Coordinates, each entry is treated as a separate search request and each set of results would have to be requested for retrieval individually. As mentioned above, the default is 20 records per target with an option to increase this to 5,000 records per target.

Improving Large Search Requests

  1. Reduce Number of Columns: The amount of memory used is directly related to the number of columns requested in the search results. A subset of the available columns is used by default, but the number can be further reduced (or increased for that matter) using the available search form options. If you want to retrieve data, you need to include at least the data set name and the mark columns. Some links shown in the search results may not work though depending on which default columns are removed. For GET requests, the default fields will be returned unless the selectedColumnsCSV parameter is used and all the desired fields are listed in a comma separated list.
  2. Reduce Number of Rows: This may be obvious, but the more rows found (and returned) increases memory and execution time. You may for example, try reducing the search radius for target name or coordinate searches, add more constraints, or reduce the limits for range searches.
  3. Use Alternate Output Formats: Results displayed in HTML require more formatting and browser memory than other format options. Switching to CSV, VOTable, or other non-HTML output formats could therefore reduce server execution times, and writing results directly to a file rather than the browser would reduce client-side memory usage. Note however that HTML format would be needed for data retrieval unless you can
    1. use the File: WGET option to download online files (currently available for Kepler, IUE, or FUSE only), or
    2. plan to download the search results and retrieve data via a file upload (see below).
  4. Use Sort Options: If you're looking for data with the highest or lowest value for a particular column, you may be able to reduce the nuber of downloaded entries by sorting the results on that column when you submit the request. Click "Reverse" next to the sort option to reduce the displayed order of rows.
  5. Query Indexed Fields: Some fields are indexed to make queries on those fields run faster. The columns written in italics in the Field Description pages are indexed. For example, from the KIC help page at: search_fields?mission=kic10 you can see that Kepler_id, ra and Dec are all indexed fields. It's possible to index additional fields, so it's important to contact the help desk when problems occur.
  6. Use Multiple Search Requests: The last resort is to break large requests into a series of smaller ones in order to stay within the limits mentioned above.

Improving Large Data Requests

  1. Wait for Search Results:When retrieving data from a search request, it's important to wait till ALL the results are displayed before submitting a data request. The search results will begin to be displayed before the script has completed execution but the "submit data" option will not work until the script completed. A good way to tell if the search script is done is to either look for the "Done" message at the bottom of the browser window (as in Firefox), or look for the phrase "Page 1 of n" at the beginning of the search results. The number "n" will be blank until the script is completed. Another indicator is the striping of alternate rows which is one of the last actions performed.
  2. File Uploads directly to Data Request page: It is possible to submit file uploads directly to the data retrieval interface at:
    http://archive.stsci.edu/cgi-bin/kepler/dataset_lookup for Kepler,
    http://archive.stsci.edu/cgi-bin/dataset_lookup for HST, or
    http://archive.stsci.edu/cgi-bin/fuse/dataset_lookup for FUSE.
    The uploaded file must be a list of data set names, and the data retrieval limit mentioned above is still in effect. The uploaded file could be created by saving results from a search request with the output format such as "File: comma-separated values".
  3. Submit Multiple Requests: As for data search requests, the last resort is to break large requests into a series of smaller ones in orer to stay within the limits mentioned above.