Submitting Large Requests
If you are interested in searching for or retrieving more than
a thousand data sets, here are some suggestions that may help your requests
run more smoothly. In the event you do have problems though, please
contact the STScI help desk at archive@stsci.edu and explain the problem.
We may be able to accomodate special requests.
The information below applies to all MAST missions (accessed via archive.stsci.edu)
, and describes requests submitted
from both a MAST search form, and a HTTP GET request. See the
MAST Services page for more information on
submitting GET requests. For more information on ways to submit HST requests
and HST-specific restrictions, see the HST
bigsearch_request
web page.
Request Limits
First, it is important to distinguish between search requests and data requests.
"Search Requests" refers to querying a database table and displaying the results.
"Data Requests" refers to submitting requests to have data retrieved from
the archive and copied to a media or location of your choice. Generally the
search requests are executed using scripts written in PHP while data retrieval
scripts are written in Perl. Although we
have been increasing our limits (mainly for the Kepler project), there
are several different restrictions that effect the allowed (and practical) size
of both of these requests:
- Execution Time Limit: Currently requests have a 5 minute execution time limit.
This is set by both our Apache web server and by our PHP scripting language. Perl scripts
have no execution time limit but users are still subject to the 5 minute Apache limit.
Requests taking more than 5 minutes should be divided into smaller requests.
- Data Search Limits: In theory, users can submit queries
that will find up to 50 million rows, however by default,
the standard search forms only display the first 1,0001 rows.
Using the "max records" search form element, the number
displayed may be increased up to 25,001. For the file upload forms, the
current default is to retrieve 20 records per target but the limit can be
increased up to 5,000 records per target. For
HTTP GET requests, the max_records parameter defaults to 2,000 records, but
may be set to any value. The more entries returned however,
the longer the execution times, and the more memory required by the browser
if results are displayed in HTML. This could affect javascript features such as
sorting and paging results.
- Memory Limits: - As mentioned above, large requests can hit memory limits
on both the MAST web server AND the users web browser. Server limits can depend on the
number of concurrent users and is hard to quantify. Browser limits depend on the amount
of memory allocated for programs such as those written in Javascript to allow sorting
or paging of search results. Safari in particular
is known to reserve less memory for Javascript than browsers such as Firefox or Chrome
so users may want to try other browsers if problems occur.
- Data Retrieval Limit: Even though all the displayed search result entries can be
submitted for data retrieval, there is a limit on the number of data sets that can be
reasonably submitted in one request. Initial tests with retrieving Kepler light curves
indicate requests for up to 10,000 data sets are possible. In one test, staging roughly 11,000
light curves took about 8 hours. The staff however are still
investigating this limit.
- File Upload Limits:
File uploads can include up to 10,000 entries. For Data ID's (or Kepler ID's),
the results will be written to a single output table allowing all returned data
to be retrieved simply by marking all rows and clicking the Submit
button. For file uploads containing Target names or Coordinates, each entry
is treated as a separate search request and each set of results would
have to be requested for retrieval individually. As mentioned above,
the default is 20 records per target with an option to increase this to 5,000
records per target.
Improving Large Search Requests
- Reduce Number of Columns:
The amount of memory used is directly related to the number of
columns requested in the search results. A subset of the
available columns is used by default, but the number can be further reduced
(or increased for that matter) using the available search form options.
If you want to retrieve data, you need to include at least
the data set name and the mark columns. Some links shown in the search results may
not work though depending on which default columns are removed.
For GET requests, the default fields will be returned unless the
selectedColumnsCSV parameter is used and all the desired fields are listed in a
comma separated list.
- Reduce Number of Rows:
This may be obvious, but
the more rows found (and returned) increases memory and execution time.
You may for example, try reducing the search radius for target name or coordinate searches,
add more constraints, or reduce the limits for range searches.
- Use Alternate Output Formats:
Results displayed in HTML require more formatting and browser memory
than other format options.
Switching to CSV, VOTable, or other non-HTML output formats could
therefore reduce server execution times, and writing results directly
to a file rather than the browser would reduce client-side memory usage.
Note however that HTML format would be needed for data retrieval unless you
can
- use the File: WGET option to download online files (currently
available for Kepler, IUE, or FUSE only), or
- plan to download the search results and retrieve data via a file
upload (see below).
- Use Sort Options:
If you're looking for data with the highest or lowest value
for a particular column,
you may be able to reduce the nuber of downloaded
entries by sorting the results on that column when you submit the request.
Click "Reverse" next to the sort option to reduce the displayed order of rows.
- Query Indexed Fields:
Some fields are indexed to make queries on those fields run faster.
The columns written in italics in the Field Description pages are indexed. For
example, from the KIC help page at:
search_fields?mission=kic10
you can see that Kepler_id, ra and Dec are all indexed fields. It's
possible to index additional fields, so it's important to contact the
help desk when problems occur.
- Use Multiple Search Requests:
The last resort is to break large requests into a series of smaller
ones in order to stay within the limits mentioned above.
Improving Large Data Requests
- Wait for Search Results:When retrieving data from a search request,
it's important to wait till ALL the results are displayed before submitting
a data request. The search results will begin to be displayed before the script
has completed execution but the "submit data" option will not work until the
script completed. A good way to tell if the search script is done is to
either look for the "Done" message at the bottom of the browser window
(as in Firefox), or look for the phrase "Page 1 of n" at the beginning of the
search results. The number "n" will be blank until the script is completed.
Another indicator is the striping of alternate rows which is one of the
last actions performed.
- File Uploads directly to Data Request page:
It is possible to submit file uploads directly to the data retrieval
interface at:
http://archive.stsci.edu/cgi-bin/kepler/dataset_lookup
for Kepler,
http://archive.stsci.edu/cgi-bin/dataset_lookup
for HST, or
http://archive.stsci.edu/cgi-bin/fuse/dataset_lookup
for FUSE.
The uploaded file must be a list of data set names, and the
data retrieval limit mentioned above is still in effect.
The uploaded file could be created by saving results from a search request
with the output format such as "File: comma-separated values".
- Submit Multiple Requests:
As for data search requests, the last resort is to break large requests into a series of smaller
ones in orer to stay within the limits mentioned above.