spacer link to MAST page spacer logo image spacer
 
link to STScI page


Submitting Large Requests

If you are interested in searching for or retrieving more than several thousand data sets, here are some suggestions that may help your requests run more smoothly. In the event you do have problems though, please contact the STScI help desk at archive@stsci.edu and explain the problem. We may be able to accomodate special requests.

The information below applies to all MAST missions (accessed via archive.stsci.edu) , and describes requests submitted from both a MAST search form, and a HTTP GET request. See the MAST Services page for more information on submitting GET requests. For more information on ways to submit HST requests and HST-specific restrictions, see the HST bigsearch_request web page.

Request Limits

First, it is important to distinguish between search requests and data requests. "Search Requests" refers to querying a database table and displaying the results. "Data Requests" refers to submitting requests to have data retrieved from the archive and copied to a media or location of your choice. Generally the search requests are executed using scripts written in PHP while data retrieval scripts are written in Perl. Although we have been increasing our limits (mainly for the Kepler and the Hubble Source Catalog projects), there are several different restrictions that effect the allowed (and practical) size of both of these requests:

  1. Data Search Limits: In theory, users can submit queries that will find up to 100 million rows, however by default, the standard search forms only display the first 5,000 rows. Using the "max records" search form element, the number displayed may be increased up to 50,000. For the file upload forms, the current default is to retrieve 20 records per target but the limit can be increased up to 5,000 records per target. For HTTP GET requests, the max_records parameter defaults to 2,000 records, but may be set to any value (< 100 million). The more entries returned however, the longer the execution times, the more data to transmit to the user, and the more memory required by the browser if results are displayed in HTML. The latter could affect javascript features such as sorting and paging results.
  2. Memory Limits: - As mentioned above, large requests can hit memory limits on both the MAST web server AND the users web browser. Server limits can depend on the number of concurrent users and is hard to quantify. Browser limits depend on the amount of memory allocated for programs such as those written in Javascript to allow sorting or paging of search results. Safari in particular is known to reserve less memory for Javascript than browsers such as Firefox or Chrome so users may want to try other browsers if problems occur.
  3. Data Retrieval Limit: Even though all the displayed search result entries can be submitted for data retrieval, there is a limit on the number of data sets that can be reasonably submitted in one request. Initial tests with retrieving Kepler light curves indicate requests for up to 10,000 data sets are possible. In one test, staging roughly 11,000 light curves took about 8 hours. The staff however are still investigating this limit.
  4. File Upload Limits: File uploads can include up to 10,000 entries. For Data ID's (or Kepler ID's), the results will be written to a single output table allowing all returned data to be retrieved simply by marking all rows and clicking the Submit button. For file uploads containing Target names or Coordinates, each entry is treated as a separate search request and each set of results would have to be requested for retrieval individually. As mentioned above, the default is 20 records per target with an option to increase this to 5,000 records per target.

Improving Large Search Requests

  1. Reduce Number of Columns: The amount of memory used is directly related to the number of columns requested in the search results. A subset of the available columns is used by default, but the number can be further reduced (or increased for that matter) using the available search form options. If you want to retrieve data, you need to include at least the data set name and the mark columns. Some links shown in the search results may not work though depending on which default columns are removed. For GET requests, the default fields will be returned unless the selectedColumnsCSV parameter is used and all the desired fields are listed in a comma separated list.
  2. Remove Mark Button: The mark button is needed for retrieving data and is normally added in mission searches, but this can add a significant amount of execution time. For example, a Kepler request to find all false positives (currently 36,217 targets) took about 90 seconds with the Mark button, and 23 seconds without it. Skipping the formatting (see below) further reduced this time to about 12 seconds. So, if you don't need to retrieve data, you may want to remove it from the output columns. Note the Mark button is only used for MAST mission searches when results are displayed in HTML.
  3. Reduce Number of Rows: This may be obvious, but the more rows found (and returned) increases memory and execution time. You may for example, try reducing the search radius for target name or coordinate searches, add more constraints, or reduce the limits for range searches.
  4. Skip Formatting: An option now exists to skip the formatting of the search results after the query is completed. In some cases, skipping this step can reduce execution times by a factor of 2. (Note this step applies to ALL output formats.)
  5. Remove Null Columns: A new option exists to allow users to remove columns from the search results which contain all null entries. This can be useful when querying sparsely populated tables like the Hubble Source Catalog. Although some time is needed to calculate the null columns, if a large number are removed, the request may even run faster. Note columns containing zeroes are not removed.
  6. Use Alternate Output Formats: Results displayed in HTML require more formatting and browser memory than other format options. Switching to CSV, VOTable, or other non-HTML output formats could significantly reduce execution times, and writing results directly to a file rather than the browser would reduce client-side memory usage. In one example, a 47 second HTML request ran in 4 seconds using csv output. Note however that HTML format may be needed for data retrieval unless you can:
    1. use the File: WGET option to download online files (currently available for Kepler, IUE, or FUSE only), or
    2. plan to download the search results and retrieve data via a file upload (see below).
  7. Use Sort Options: If you're looking for data with the highest or lowest values for a particular column, you may be able to reduce the nuber of downloaded entries by sorting the results on that column when you submit the request. Click "Reverse" next to the sort option to reverse the displayed order of rows (i.e., from highest to lowest).
  8. Query Indexed Fields: Some fields are indexed to make queries on those fields run faster. The columns written in italics in the Field Description pages are indexed. For example, from the KIC help page at: search_fields?mission=kic10 you can see that Kepler_id, ra and Dec are all indexed fields. It's possible to index additional fields, so it's important to contact the help desk when problems occur.
  9. Use Multiple Search Requests: The last resort is to break large requests into a series of smaller ones in order to stay within the limits mentioned above.

Improving Large Data Requests

  1. Wait for Search Results:When retrieving data from a search request, it's important to wait till ALL the results are displayed before submitting a data request. The search results will begin to be displayed before the script has completed execution but the "submit data" option will not work until the script completed. A good way to tell if the search script is done is to either look for the "Done" message at the bottom of the browser window (as in Firefox), or look for the phrase "Page 1 of n" at the beginning of the search results. The number "n" will be blank until the script is completed. Another indicator is the striping of alternate rows which is one of the last actions performed.
  2. File Uploads directly to Data Request page: It is possible to submit file uploads directly to the data retrieval interface at:
    https://archive.stsci.edu/cgi-bin/kepler/dataset_lookup for Kepler,
    https://archive.stsci.edu/cgi-bin/dataset_lookup for HST, or
    https://archive.stsci.edu/cgi-bin/fuse/dataset_lookup for FUSE.
    The uploaded file must be a list of data set names, and the data retrieval limit mentioned above is still in effect. The uploaded file could be created by saving results from a search request with the output format such as "File: comma-separated values".
  3. Submit Multiple Requests: As for data search requests, the last resort is to break large requests into a series of smaller ones in orer to stay within the limits mentioned above.