If you are interested in searching for or retrieving more than
several thousand data sets, here are some suggestions that may help your requests
run more smoothly. In the event you do have problems though, please
contact the STScI help desk at archive@stsci.edu and explain the problem.
We may be able to accomodate special requests.
The information below applies to all MAST missions (accessed via archive.stsci.edu)
, and describes requests submitted
from both a MAST search form, and a HTTP GET request. See the
MAST Services page for more information on
submitting GET requests. For more information on ways to submit HST requests
and HST-specific restrictions, see the HST
bigsearch_request
web page.
Request Limits
First, it is important to distinguish between search requests and data requests.
"Search Requests" refers to querying a database table and displaying the results.
"Data Requests" refers to submitting requests to have data retrieved from
the archive and copied to a media or location of your choice. Generally the
search requests are executed using scripts written in PHP while data retrieval
scripts are written in Perl. Although we
have been increasing our limits (mainly for the Kepler and the Hubble Source Catalog
projects), there
are several different restrictions that effect the allowed (and practical) size
of both of these requests:
Data Search Limits: In theory, users can submit queries
that will find up to 100 million rows, however by default,
the standard search forms only display the first 5,000 rows.
Using the "max records" search form element, the number
displayed may be increased up to 50,000. For the file upload forms, the
current default is to retrieve 20 records per target but the limit can be
increased up to 5,000 records per target. For
HTTP GET requests, the max_records parameter defaults to 2,000 records, but
may be set to any value (< 100 million). The more entries returned however,
the longer the execution times, the more data to transmit to the user,
and the more memory required by the browser if results are displayed in HTML.
The latter could affect javascript features such as sorting and paging results.
Memory Limits: - As mentioned above, large requests can hit memory limits
on both the MAST web server AND the users web browser. Server limits can depend on the
number of concurrent users and is hard to quantify. Browser limits depend on the amount
of memory allocated for programs such as those written in Javascript to allow sorting
or paging of search results. Safari in particular
is known to reserve less memory for Javascript than browsers such as Firefox or Chrome
so users may want to try other browsers if problems occur.
Data Retrieval Limit: Even though all the displayed search result entries can be
submitted for data retrieval, there is a limit on the number of data sets that can be
reasonably submitted in one request. Initial tests with retrieving Kepler light curves
indicate requests for up to 10,000 data sets are possible. In one test,
staging roughly 11,000 light curves took about 8 hours. The staff however are
still investigating this limit.
File Upload Limits:
File uploads can include up to 10,000 entries. For Data ID's (or Kepler ID's),
the results will be written to a single output table allowing all returned data
to be retrieved simply by marking all rows and clicking the Submit
button. For file uploads containing Target names or Coordinates, each entry
is treated as a separate search request and each set of results would
have to be requested for retrieval individually. As mentioned above,
the default is 20 records per target with an option to increase this to 5,000
records per target.
Improving Large Search Requests
Reduce Number of Columns:
The amount of memory used is directly related to the number of
columns requested in the search results. A subset of the
available columns is used by default, but the number can be further reduced
(or increased for that matter) using the available search form options.
If you want to retrieve data, you need to include at least
the data set name and the mark columns. Some links shown in the search results may
not work though depending on which default columns are removed.
For GET requests, the default fields will be returned unless the
selectedColumnsCSV parameter is used and all the desired fields are listed in a
comma separated list.
Remove Mark Button:
The mark button is needed for retrieving data and is normally added
in mission searches, but this can add a significant amount of execution time.
For example, a Kepler request to find all false positives (currently 36,217
targets) took about 90 seconds with the Mark button, and 23 seconds without it.
Skipping the formatting (see below) further reduced this time to
about 12 seconds. So, if you don't need to retrieve data, you may want to
remove it from the output columns.
Note the Mark button is only used for MAST mission searches
when results are displayed in HTML.
Reduce Number of Rows:
This may be obvious, but
the more rows found (and returned) increases memory and execution time.
You may for example, try reducing the search radius for target name or coordinate searches,
add more constraints, or reduce the limits for range searches.
Skip Formatting:
An option now exists to skip the formatting of the search results
after the query is completed. In some cases, skipping this step can
reduce execution times by a factor of 2. (Note this step applies to
ALL output formats.)
Remove Null Columns:
A new option exists to allow users to remove columns from the search results
which contain all null entries. This can be useful when querying sparsely
populated tables like the Hubble Source Catalog. Although some time is
needed to calculate the null columns, if a large number are removed,
the request may even run faster. Note columns containing zeroes are
not removed.
Use Alternate Output Formats:
Results displayed in HTML require more formatting and browser memory
than other format options.
Switching to CSV, VOTable, or other non-HTML output formats could
significantly reduce execution times, and writing results directly
to a file rather than the browser would reduce client-side memory usage.
In one example, a 47 second HTML request ran in 4 seconds using csv output.
Note however that HTML format may be needed for data retrieval unless you
can:
use the File: WGET option to download online files (currently
available for Kepler, IUE, or FUSE only), or
plan to download the search results and retrieve data via a file
upload (see below).
Use Sort Options:
If you're looking for data with the highest or lowest values
for a particular column, you may be able to reduce the nuber of downloaded
entries by sorting the results on that column when you submit the request.
Click "Reverse" next to the sort option to reverse the displayed order of
rows (i.e., from highest to lowest).
Query Indexed Fields:
Some fields are indexed to make queries on those fields run faster.
The columns written in italics in the Field Description pages are indexed. For
example, from the KIC help page at:
search_fields?mission=kic10
you can see that Kepler_id, ra and Dec are all indexed fields. It's
possible to index additional fields, so it's important to contact the
help desk when problems occur.
Use Multiple Search Requests:
The last resort is to break large requests into a series of smaller
ones in order to stay within the limits mentioned above.
Improving Large Data Requests
Wait for Search Results:When retrieving data from a search request,
it's important to wait till ALL the results are displayed before submitting
a data request. The search results will begin to be displayed before the script
has completed execution but the "submit data" option will not work until the
script completed. A good way to tell if the search script is done is to
either look for the "Done" message at the bottom of the browser window
(as in Firefox), or look for the phrase "Page 1 of n" at the beginning of the
search results. The number "n" will be blank until the script is completed.
Another indicator is the striping of alternate rows which is one of the
last actions performed.
Submit Multiple Requests:
As for data search requests, the last resort is to break large requests into a series of smaller
ones in orer to stay within the limits mentioned above.