From: "Peter F. Couvares" To: "ggraham" Cc: "Ruth Pordes" ; ; "Arie Shoshani" ; ; ; "Richard J. Cavanaugh" Subject: Re: [Ppdg-steering] MOP review on May 9th Date: Thursday, May 09, 2002 12:38 PM Ruth asked: >> 1) The "production" deployment of MOP has seems to have required a large >> fraction of the combined set of US CMS physics grid resources for the >> past few months. Was this inevitable? what could have been done >> differently to reduce this huge load on the many who have contributed so >> actively over the last few months? By "resources", I assume you mean people? If so, I think in retrospect that this was largely inevitable, yes -- for the simple reason that the grid tools we're relying on, while proven in concept, haven't been used in many (any?) real production applications before, and have a lot of rough edges. There's little we could have done this spring to change that. We could, however, have recognized it more clearly before starting, and made sure that we had a core developer on board representing each major grid component to help us diagnose and fix problems. One of the reasons we've spent so much time on Globus bugs lately is not because we found fewer bugs in other components (Condor-G, GDMP, etc.), but because we had core developers from those other components on hand to find and fix them immediately. So the turnaround was better. But the bottom line is that we're early adopters, and we're doing the grid beta-testing whether we like it or not. Greg wrote: > The deployment of MOP takes place in two phases: A development deployment > intended to be at Wisconsin and a production deployment intended to be at > FNAL with respect to the "mop-master". Currently, we have but one > deployment at Wisconsin which is doubling as both development and > deployment right now. This is leading to some problems in getting > official results back to CMS production, but it has been fantastic from a > development point of view. Right. But the very fact that we need to do all this development is a sign that we're not yet prepared for hands-free production. I don't think it would be a good idea for us to revert back to some "proof-of-concept" activity -- but I think that we need to recognize the difference between smooth, consistent, hands-free production, and herky-jerky, heavily-babysat production. We're still doing the latter. :-) > In my opinion, the load carried by many people in PPDG (and one in > CMS ;-) was inescapable in that any reduction in load would have directly > impacted the good results: namely the amount of bugs found, reported, and > fixed in globus, Condor-G, DAGMAN, GDMP, etc is linear in the time > spent by Peter, Shahzad, and Alain. (Alan DeSmet has just started.) The > amount of support from CMS needed to enable this work (a clean integration > with IMPALA, lobbying for a production assignment) was directly > proportional to the time that I put in. I agree completely. > In the near future, I hope (and propose in the questionairre) that we > move to create a real production MOP master site at FNAL. This assumes > that the remaining level of bugs makes it feasible. This should > at least decouple the pressures from CMS Production coordinators > from the MOP developers. However, I cannot answer the question yet as to > whether it is ready to deploy on "non-grid" resources. We may talk about > that on Thursday. Right. My feeling is that we're not there yet, but we're close. It's unpredictable, though, because we consistently discover new bugs only when existing bugs are squashed, and we're able to ramp up production a little more than we had. I think a few times this spring we made the mistake of assuming that once the bugs in front of us were squashed, we'd be ready to rip. >> 2) Do you have a concern as to how the US CMS PPDG developments >> overlap/interact with the overall CMS grid developments and program? Is >> there anything you wish PPDG was doing more or differently to help avoid >> duplication of effort between the EU and US sides of the program? This is an area in which I'm confused, to be honest. I've seen a lot of powerpoint presentations, and I get the sense lots of people are making sure the EU and US grand plans don't duplicate each others' work too much. But it's not clear to me what actual software really exists today and might be useful to us, or visa-versa. It might be useful at one of our cross-project meetings to spend more time just doing some show-and-tell of the current state of our software (not just slides), with the explicit goal of trying to recognize opportunities for immediate collaboration or re-use of software. I wouldn't want to delay our work waiting for another project's component to be finished, just for the sake of integration. But if related grid software exists, I think we could probably be doing more to utilize it, and feed our experiences back and forth. I'm really just speculating, though, since I don't know what other groups are actually accomplishing today, and how much time it would take to work together. > MOP is essentially a method for packaging CMS Production scripts so that > they can run with Condor-G/DAGMAN and Globus. Right. Although making MOP useful required a lot of integration work with IMPALA, MOP itself is really just a relatively small, simple DAG-generator. It automates the process of generating DAGs to run lots of grid-unaware jobs in parallel on remote sites, creating and filling in the implicit jobs needed to stage data back and forth, etc. I'm honestly not sure if the actual MOP code will fit cleanly within the GriPhyN virtual-data software, or the EDG software -- both of them will certainly require similar functionality (i.e., a DAG-generator), but may end up re-implementing it within another component, like the "planner". > GDMP is both a replica manager and a file transfer method. The MOP run > has exposed some limitations of GDMP as it now exists, but nonetheless > GDMP debugging contributed to the overall Globus debugging we have > experienced. I am not concerned about overlap with EU grid tools since > GDMP is the province of WP2, but the continued PPDG support of GDMP is > greatly desired. A lot of time and energy has been invested by CMS to > give GDMP a chance to work in a production environment. I am concerned > that GDMP overlaps with other CMS file transfer tools, and that if > support for GDMP from PPDG wanes then we will lose this important > laboratory for doing grid enabled file transfers. I'll be honest: I'm not sure we want or need GDMP. I think, on the testbed, we intended to use it to do things it was not exactly designed to accomplish. But for our purposes, just using a replica catalog and gridftp might be simpler. >> 3) Was sufficient and appropriate documentation available for underlying >> software, interface and protocol specs for the components MOP is >> integrating? Do you think that this is an area that PPDG should increase >> its attention and priority to? Documentation is sparse and much of what we know is oral history. We could use more. But I think we need to be careful about exactly what we spend time documenting. >> 4) How susceptible do you think MOP is going to be to changes in >> interface in the underlying grid components - for example those included >> in the Virtual Data Toolkit to date? > > Since MOP is very generic and is such a thin interface, I think that it > is not very susceptible to changes in the underlying interfaces. As long > as the interfaces of GDMP are well documented, for example, this calls > for a change in only one DAG node of MOP. If the condor_submit_dag > interface changes, then the function within the MOP python code that > handles the submit needs to be changed. In my experience adding DAG > nodes and cleaning up MOP code while Peter was on vacation, the > components were fairly well isolated. I agree. However, one major caveat is that if MOP is expected to support multiple versions of these tools simultaneously, it may require extensive code to identify which version we're talking to and use the appropriate interface. It's simple only when, like now, we require a single supported version of Globus, GDMP, Condor-G, etc., everywhere. In "production" this may not be possible or desirable. Talk to you all in a few hours! -Peter -- Peter Couvares University of Wisconsin-Madison Condor Project Research Department of Computer Sciences pfc@cs.wisc.edu 1210 W. Dayton St. Rm #3393 (608) 265-8936 Madison, WI 53706-1685