Numeric requirements for the replica catalog service Mike Wilde, Koen Holtman, January 2002 *** DRAFT V2 *** Jan 9 2002 Draft released to support the PPDG file replication meeting of Jan 10. *** Introduction This document gives numeric requirements for the replica catalog service that is an element of Grid architectures of the GriPhyN, PPDG, and EU DataGrid projects. These requirements were gathered and summarized from input by the following experiments: ATLAS (from an EU DataGrid WP8 document), BaBar (Koen Holtman estimates verified by Andrew Hanushevsky), CMS (Communication from Koen Holtman, Heinz Stockinger), D0 (from Ruth Pordes), SDSS (James Annis), STAR (from Matthias Messer via Doug Olson). *** Architectural and usage assumptions Architectural requirements specifications of the replica catalog service are found in several documents - the EU DataGrid and WP2 architecture document - the GriPhyN/PPDG Data Grid Reference Architecture - the CMS Data Grid system overview and requirements document We believe that all of these requirements statements are mutually compatible, even though there are differences in terminology, and even though some require more features of their 'catalog' than others. The numeric requirements in this document are written under the following assumptions. - each virtual organization operates a single replica catalog service inside its data grid system - the service is not necessarily implemented with a single server machine, in fact in the longer term it is more likely that several machines will be used as servers and/or caches. - the service maintains a mapping from logical file names to physical file names - one logical file name maps to a (possibly empty) set of physical filenames - from everywhere in the virtual organization's grid, an application executable can use the replica catalog service by calling functions provided by a library linked in with the executable. - the mapping from logical to physical file names may be maintained with some type of relaxed consistency, for example updates may not become visible everywhere immediately. - the replica catalog service might also provide a mapping from logical file names to pieces of metadata, however for this second mapping this document does not specify requirements -- in fact we do not know of any numeric requirements existing yet in this area - the service does not necessarily implement functions for reliable file replication and consistency maintenance: some or all these are more likely implemented by higher-level components. - some experiments prefer or require that the logical file names are free-form strings, i.e. strings whose contents are not constrained by the replica catalog service. Some experiments prefer or require this also of the physical file names; in particular some do not assume that the physical file name always contains some encoding of the full logical file name. *** Capacity Number of logical files (by end of each year): 2001: 1M 2002: 2M 2003: 10M 2004: 10M 2006: 40M Maximum number of physical files per logical file: 2001: 10 2006: 500 (average is expected to be less than 10) Number of users (processes simultaneously running that use the service): 2001: 1000 2006: 10000 (interpretation of the above numbers: if the service were implemented so that every user would have a continuous tcp/ip connection to it, there would be 1000 connections into the service in 2001.) *** Multi-file operations Catalog users might be able to specify some catalog interactions operations as multi-file operations, if a multi-file operation API is provided. But according to our understanding of the experiments, such aggregation cannot be assumed to occur so frequently that its optimizing effects will `save' a catalog system that is too slow to support equivalent workloads with all operations being single-file operations. The rates given below are therefore all for single-file operations. *** Access rates Number of update operations completed per minute. An update can be the insertion of new logical file name with an initial physical file, adding a new physical file to the mapping of an existing logical file, deleting a physical file from a mapping, etc. 2001: 60/minute 2006: 600/minute In these counts all update operations are assumed to be operations with a scope of a single logical file -- i.e. this is not a count of 60 `bulk update' operations per minute, which each insert for example 10K logical files. Number of lookup operations completed per minute. These numbers are taken to be `simple' lookup operations that look up information associated with a single named logical file. 2001: 60,000/minute 2006: 600,000/minute Note: 60,000 operations per minute does not mean that a single operation must be completed in 1/60,000 minute = 1/1000th second, as operations may be processed concurrently. FOR COMPARISON: - www.tpc.org has SQL database transaction processing benchmark results. Highest result on benchmark TPC-C is for a configuration with 36 server machines: 709,000/minute. For the Dell Poweredge 4400 (current frontend machine of Caltech tier2 farm, 2-CPU machine): 16,263/minute. - www.tpc.org has TPC-W WIPS (web interactions per second for a web shopping workload) results. These counts are in completed HTTP requests per second for a web shopping workload. Highest result: 7,073/second (424,380/minute) for a service configuration with 23 server machines. END OF COMPARISON *** Relaxed consistency metrics Delay time allowed (in seconds) for an update to become visible to lookup operations performed anywhere in the grid 2001: 30 2006: 30 Delay time allowed (in seconds) for an update to become visible to lookup operations performed on the same grid site that invoked the update operation 2001: 5 2006: 5 *** Service crash recovery requirements We know of no numeric requirements for service crash recovery. One general assumption is server daemons involved in providing the service should be able to restart automatically with no manual intervention needed if the system they run on crashes and is rebooted. If the contents of the catalog become corrupted or lost through hardware or software failure, there are different strategies for restoring it. Some experiments expect to be able to re-create the catalog contents from data kept in other repositories. Other experiments expect that the catalog service will come with the ability to make backups of the catalog contents and to restore them later. *** Availability requirements in the case of network or hardware failure There is no hard numeric requirement we know of for, say, 99.N% availability, or continued service if less than X simultaneous component failures, etc. Rather the general requirement is that the service makes a good effort of being tolerant of network and hardware failures. How much of an effort is required is very much a function of what the failure characteristics of real-life systems and networks are. This is an area that needs more study. (e.g. preliminary results based on pinger data indicate that network connectivity between CERN and the average site in western Europe and the US can reach CERN is only down for about 0.5-2% of the time.) *** Access patterns ...to write.. How is the service workload divided over calling users and over the logical files in there? *** Globus replica catalog 'Collections' and other partitionings of the logical file name space The Globus replica catalog documentation defines a particular concept of 'Collections'. This collection is a group of logical files, each logical file is always in exactly one collection. The collection that the logical file is in is encoded as a substring of the logical file name. We see collections as the externally visible effects of a particular implementation optimization choice. Support for collections is not a user requirement for the catalog service -- in fact some experiments might find the implied constraints on logical file naming burdensome. If the above numeric performance can be reached without exposing a collection concept to the users of the catalogs, that would be preferred by several experiments. However on the other hand, if the catalog implementation requires that the experiment define collections at some granularity and reflects these into the logical file names, this requirement will not be exceedingly burdensome to the experiment. Other partitionings of the logical file name space can be considered, for example a partitioning based on a hash function, internally in the catalog service implementation, that is invisible to the users of the catalog service. Another is a custom partitioning function defined by the catalog administrator, encoding knowledge about the logical file name space of the virtual organization, in such a way to optimize the internal workings of the catalog service. *** OTHER OPTIMIZATIONS THAT NEED TO BE DISCUSSED??? *** Appendix: Per-experiment detailed numeric requirements .. attach Mike's spreadsheet with more annotations in it where the data comes from, and atlas modified with a single 2002 line according to the numbers below. Current version of this appendix is at or near the top of the page: http://www-unix.mcs.anl.gov/~wilde/replication/ *** TEMPORARY APPENDIX: ATLAS requirements --> this needs to be put in the spreadsheet. --> the numbers above in this document already take these atlas numbers into account The text below is from the ATLAS section of the EU DG WP8 D8.1b document at http://datagrid-wp8.web.cern.ch/DataGrid-WP8/Documents/Workspace/D8.1/D8.1.b_v2.0.pdf As far as I (Koen) can determine these are requirements for the service to be delivered by the EU DG at the end of the project (end 2002). 7.4.1.2 Scalability Requirements o The Replica Service Catalog must be able to gracefully manage on order: o 10 PetaBytes of data o 10,000,000 Logical Files o 1000 Storage Elements o 10 PFN per LFN o 100 Compute Elements o The performance of the Replica Service should be such that the penalty for individual atomic Replica Service operations is <25% of the time for a file open operation. o The API for the Replica Catalog should not prohibit use on non-network connected compute elements (such as personal computers or laptops). (end of text from EU DG WP8 D8.1b) *** Appendix: CMS numeric details and methodology explanations Contents of this appendix: http://www-unix.mcs.anl.gov/~wilde/replication/Koen.LHC.rcat.Reqs.2001.0322.txt