PPDG Site AAA Issues List

This PPDG Site AAA project has two working documents they are creating in parallel. This first is an issues list to capture the issues of concern with function and operation of the GRID tools currently under PPDG review. Some of these issues will have various resolutions with varying requirements. Some will have common requirements and/or resolution.

The second document in the set is a requirements list that all acceptable software must meet. It is expected that not all issues result in requirments, however all requirements should have a corresponding issues discussion.

  1. Authentication

    Authentication is typically of 3 different types of entities, that have different assumptions and natural methods: Users, Hosts and Services. The GSI/PKI authentication methods for Hosts and Services are quite similar in principal (if not revocation usage) to those we use now. For this discussion, I'll concentrate on User Authentication.

    1. Interactive User Authentication:

      Under the current Globus Toolkit infrastructure, a user authenticated request is commonly generated today in one of two ways:

      1. The most common situation is that a user maintains a private key in an encrypted form on their local machine. When they want to compute "on the Grid", they decrypt this key (by providing a passphrase) and use it to generate a temporary X.509 proxy credential, which is then used to authenticate subsequent requests to remote resources.

      2. An alternative strategy used at some sites, is for the user to authenticate via Kerberos locally, and then contact an on-line Kerberos Certificate Authority (KCA: supported for example by KX509) to obtain the X.509 proxy credential.

      Increasingly in the news are "smartcard" solutions which effectively act as individual proxy generators. Since the smartcards do not export the longterm private key and are much more resistant to attack than desktop systems, they share many of the same features as the on-line proxy generator service. Because they are individual and highly portable, they may be vulnerable to different attacks (e.g.. theft), but for purposes of this discussion, I consider them the same as method #2.

      NB: I consider proxy repositories (ala myproxy) to be distinct from proxy generation services (ala KX509). In the former, the user may still retain the ability to make a proxy from (another copy of) the longterm credential. In the second, the user does not have that ability.

      In approach 1.1.1 above, two forms of credential are at risk:

      a) The user private key exists on a user-only-readable file, in encrypted form. So there are presumably several possible risks to be concerned about:

      1. - the user might make that file world readable (and surely some users will do so), in an environment in which other users have access to the relevant file system

      2. - storage is such that access to the private key file is vulnerable to capture (e.g.. network sniff of file system transfer, etc.)

      3. - the user might choose a pass phrase that is easily "broken" by someone who gains access to the file system

      b) The proxy credential private key exists in a user-only-readable file, in *unencrypted* form. This key is vulnerable to the same exposure risks noted above. However, the value of this key is time-limited and that lifetime cannot be altered by the possessor of the key. Therefore the vulnerability introduced here is similar to that of many other successfully deployed systems (AFS, Kerberos, etc.)

      In approach 1.1.2 above, only the second credential is at risk to user/system misconfiguration. The first credential is at risk to theft/misuse via primarily the following methods:

      1. - the user who exposes the access secret needed to generate the proxy (e.g.. password written on desktop, etc.)

      2. - mis-configured or vulnerable proxy generation servers.

      There is considerable discussion on what protection measures are necessary for the short-lived proxy credential. It's clear that the lifetime of the proxy is a critical parameter in those discussions.

    2. Unattended User Authentication:

      There is a hybrid case (unattended user jobs) that has characteristics of both a user and service. The most frequent manifestation of this usage case are batch jobs and cron jobs. One can think of cron as a very simplistic batch system. In such a case, some service (the batch system, cron, etc.) is receiving a request to perform some task on behalf of the user. There are two authentications needed, which may in principal be widely separated in time. First the authentication of the request for a command to be run and second the authentications needed by the command at time of execution. The first can, in almost all cases, use the normal user authentication methods described above and is pseudo-realtime. The second typically wants to grant the user's authorizations (or a dynamic, usually difficult to define subset) to the command. There are three general approaches:

      1. The command authenticates as the user and requests made are indistinguishable from those made interactively by the user.

      2. The command authenticates as an identity algorithmically derived from (but separate from) the user. Commands are issued as that derived identity.

      3. The command authenticates as the "batch" service.

      The operator of the batch service will necessarily have the ability to authenticate in whichever method selected for the lifetime of the authentication "secret". Dependent on the skill of its stewardship, attackers of the batch system can gain that ability as well.

      Approach 1.2.1 exposes the users (sole ?) identity to the full risks of unattended operation and allows for no distinguishing action in event of compromise/failure. FNAL considers this unacceptable. (I understand this is the current Condor approach ?)

      Approach 1.2.2 requires the maintenance of multiple identities per person (though they may be automatically associated with the primary identity) and for specific inclusion of those separate identities in the resource access control lists. (This is the approach taken by FNAL.)

      Approach 1.2.3 presumes the resources to be accessed by the job are either a) fully available to (any user of) the batch service or b) managed by a service that can carry on a trusted authorization based on the user identity as authenticated by the batch service. Option a) is unacceptable in most operations. Option b) is the the CAS approach as I understand it.

    3. Revocation of Authentication:

      All authentication relies on some secret (password, private key file, hardware token, etc.) that can be compromised and used by unauthorized persons. In this regard, user, host, and service authentication share a common concern.

      In the event of a compromise, that authentication ability must be revocable on a timescale appropriate to the compromise. For example, the timescale for the need of revocation of a stolen private key file is a function of the strength of its passphrase encryption. If there is none (or the passphrase is also stolen), then the timescale needed is immediate. If it is a still secure, 20 character, random passphrase, the timescale is years (eons ?). A standard operational assumption is that revocation needs to happen ~ 24 hours. Authorization restriction methods are presumed to handle reaction times shorter than that.

      Every authentication process needs to invoke tests to determine if the authentication "secret" is a) correct and b) has not been revoked.

      In the case of compromise, one does not usually want to invalidate the identity, but rather the authentication secret (replacing it and invalidating the earlier one). In most authentication systems deployed today, the system queried to determine correctness of the authentication secret is the same one that determines its validity. Thus updating the authentication system is a simple matter of updating the copy (or hash) of the secret held by the authentication server ( set a new password with: the local password file, the NIS password file, the KDC (AFS, W2K, KRB5). With PKI those two functions are split: one can test the correctness of the authentication secret (possession of the private key) directly. There is no way to determine by inspection whether a authentication secret which was once valid, is still valid. To do so one must consult an independent authority. Furthermore, there is no way, once the private key corresponding to certificate has been compromised, to "fix" the certificate with a new private key. It must be abandoned, permanently and universally.

      In normal Globus usage, the identity is the DN of a certificate generated by a trusted CA. There may be multiple certificates (valid, expired, and revoked) issued for any one individual. To determine which certificate/private key pairs are valid one has to consult the certificate issuer. Since a validity decision is time dependent, this check must be done for every authentication.

      ( Alternately, one could refuse to revoke authentication secrets and push this responsibility onto authorization. Regardless, there has to somewhere be a reliable assertion that the authentication secrets is not known to be compromised. To not have that part of the authentication system seems incorrect. )

      The current Globus method of determining if a certificate is still valid is to presume success and examine a Certificate Revocation List (CRL), if available. (The presumption of success and fail open decision means the system is vulnerable to an attacker who can block access to the CRL.) It requires each relying party to have access to a CRL (or an on-line lookup) for each CA in every certificate chain presented. The maximum allowed age of the CRL is the maximum tolerated latency for revoking certificates. (i.e.. to have a 1 day response, one must get new CRLs every day or use on-line lookups.)

      The CRL is unique to each signing party. Thus in a chain of certificates, not only must the signature be checked, but also that that the signing CA's CRL does not list the certificate signed as invalid. An example is probably in order (since my brain hurts at that text ;-). If CA A generates a certificate for CA B who generates a certificate for user C, then to determine if the authentication for user C is valid, one must:

      1. Check that the proof of possession of private key for C (using the certificate for user C) succeeds.

      2. Check that CA B's signature of C's certificate is valid (using the certificate for CA B (generated by CA A)).

      3. Check that C's certificate is still valid (i.e. not expired, not on the CRL for CA B, etc.)

      4. Check that CA B is allowed to generate C's certificate.

      5. Check that CA A's signature of B's certificate is valid ( using the certificate for CA A stored on the system).

      6. Check that CA A's certificate is still valid (i.e. not expired)

      7. Check that CA A is allowed to generate B's certificate.

  2. Authorization

    1. Who are the necessary authorizing parties ?

      This may in fact, be a complex question and require a syntax for expressing requirements in general. However, we seem to keep coming back to a three tier system: resource manager, resource owner (site), VO. If each of these allows arbitrary complexity, then it seems reasonable to me that we could cover the required space with these three entities. However, since it's more than 2, dealing with the case of arbitrary authorizing parties may be an easy extension of this minimum.

      A possible solution would be to use a PAM-like framework for authorization decisions at the Resource level. In this model, the Resource Managers would be responsible for structuring the authorization logic appropriately for their resource. This would involve a negotiation with the parties to which they provide service and result in a decision tree using decision modules provided by the authorizing parties. Thus for a general Compute Element at Fermilab, I would envision a decision tree something like this:

      1. Check FNAL authz

      2. Loop over VO membership attributes until pass

        • If CDF_member=true

          • check CDF authz

          • check Resource CDF authz

        • If D0_member=true

          • check D0 authz

          • check Resource D0 authz

        • If CMS_member=true

          • check CMS authz

          • check Resource CMS authz

        • else

          • check Default authz

          • check Resource Default authz

        end loop

      3. Fail if no authz check passes.

      This would imply that the membership information is available with the request (eg. in a attribute in the proxy certificate). Alternately, one could have the membership checks done in realtime (at the expense of another critical path service).

    2. What data may authorization decisions require ?

      In principal, authorization decisions can be based on any arbitrary data the authorizing party chooses. The presumption that the information presented in the SSL (or GSI) connection authentication is sufficient is, in general, false. Already within the sample of the 5 labs participating in this study, we are seeing instances where the GSI information is insufficient to meet site requirement except in very restrictive configurations. Furthermore, applying different authorization requirements based on the request being made is not accomodated. To allow Grid Resources to have autonomy in their authorization, this interface has to be generalized. There are at least 3 ways this could be done:

      1. Have the resource advertise the information needed for authorization and have the requestor present this information with the GSI credential.

      2. Have the resource negotiate authorization methods (and information) with the requestor (ala SASL).

      3. Have the resource fork a separate authorization process to obtain the needed authorization information from the user.

      The first option means that a standardized method of encoding authorization information in GSI proxies has to be developed. This is the method being pursued by the EDG and CAS projects. It further assumes that all authorization tokens can be presented securely by the client without a interative response from the resource. This makes challenge/response methods difficult to deal with (but perhaps the presumption that a Grid Resouce cannot have interactive response back to the submitting user is a reasonable one). It requires the requestor to appropriately construct the proxy based on the service being requested..

      The second option is more forgiving of requestor preparation, but requires a more complex protocol than the current GSI. A framework like SASL that allows for client software to be enhanced with new methods by addition of a (system) library and for servers to present a list of acceptable methods would be appropriate here and a useful way to avoid frequent redistributions of clients to permit new methods (or fix old ones).

      The third seems like a non-starter since it would have to be able to determine relevant proxies for the requestor.

    3. When is authorization checked ?

      The current model is that authorization to receive a service is checked (only) at the time of the request. Since there are requests down a heirarchy of Grid Services to the lowest level Grid interface, functionally this means that authorization is checked by every Grid Service at its own discression. Once a service request has been granted is there reason to force rechecks of the authorization ? Again, this seems best to be an item best left to the Grid Service provider. They could implement periodic rechecks at their discression, but there seems no systematic reason to insist on it.

    4. How is authorization revoked ?

      The initial answer seems to be for authorized actions to be atomic and have no revocation method. For the purposes of handling tokens, this may be acceptable. However, it must be possible for an authorizer of an atomic action to kill the request. Consider the case of the user who submits 10 copies of a job, 9 long and one short test job. If examination of the test job indicates the code has a bug, the user may well want to kill the 9 long jobs even/especially if they are currently running. Is this to be accomplished by revoking the authorization of the existing jobs or by issuing a second request to every grid resource used asking to abort the previous one ?

    5. Can/should authorization be delegatable ?

      It may be acceptable for some requests to allow the requestor to delegate the authorization to perform the request. In this scenario, the holder of some authorization token would create a delegated authorization token specifying that it authorized the second entity to use it's initial authorization. This would have to be checked for acceptability by the issuer of the delegated token and the resource accepting the token. Is this useful ?

    6. Is authorization information private ?

      Particularly in the case where authorization tokens are presented with the GSI certificate, is there reason to obscure the authorization information. It would seem that exposure of the detailed list of rights (authorization tokens) a requestor might have would create a way of targetting priveleged identities. Merely exposing the authorizing entity may not reveal too much information (though for authorizing entities dedicated to high value credentials, this would be sufficient.

      Since, in general, the requestor does not know the identity of the relying party, how would this be done ?

  3. Auditing

    1. Who is responsible for keeping what usage information ?

    2. Is reverse mapping from local identity to invoking Grid identity available for appropriate accounting ? particularly in case of mapping onto shared or transient local accounts ?

    3. Who defines what level of accounting is required and how ? (author needed)


PPDG Site AAA Mailing List

Last modified: Tue Dec 3 17:06:18 CST 2002