Checkpoints: Software Architecture
Document
Topics
Overall, the system is soundly based
architecturally,
because:
- The architecture appears to be stable.
The need for stability is dictated by the nature of the Construction
phase: in Construction the project typically expands, adding developers who
will work in parallel, communicating loosely with other developers as they
produce the product. The degree of independence and parallelism needed in
Construction simply cannot be achieved if the architecture is not stable.
The importance of a stable architecture cannot be overstated. Do not be
deceived into thinking that 'pretty close is good enough' - unstable is
unstable, and it is better to get the architecture right and delay the onset
of Construction rather than proceed. The coordination problems involved in
trying to repair the architecture while developers are trying to build upon
its foundation will easily erase any apparent benefits of accelerating the
schedule. Changes to architecture during Construction have broad impact:
they tend to be expensive, disruptive and demoralizing.
The real difficulty of assessing architectural stability is that
"you don't know what you don't know"; stability is measured
relative to expected change. As a result, stability is essentially a
subjective measure. We can, however, base this subjectivity on more than
just conjecture. The architecture itself is developed by considering
'architecturally significant' scenarios - sub-sets of use cases which
represent the most technologically challenging behavior the system must
support. Assessing the stability of the architecture involves ensuring that
the architecture has broad coverage, to ensure that there will be no
'surprises' in the architecture going forward.
Past experience with the architecture can also be a good indicator: if
the rate of change in the architecture is low, and remains low as new
scenarios are covered, there is good reason to believe that the architecture
is stabilizing. Conversely, if each new scenario causes changes in the
architecture, it is still evolving and baselining is not yet warranted.
- The complexity of the system matches the functionality it provides.
- The conceptual complexity is appropriate given the skill and
experience of its:
- users
- operators
- developers
- The system has a single consistent, coherent architecture
- The number and types of component is reasonable
- The system has a consistent system-wide security
facility. All the security components work together to safeguard the
system.
- The system will meet its availability targets.
- The architecture will permit the system to be recovered in the
event of a failure within the required amount of time.
- The products and techniques on which the system is based match
its expected life?
- An interim (tactical) system with a short life can safely
be built using old technology because it will soon be discarded.
- A system with a long life expectancy (most systems) should
be built on up-to-date technology and methods so it can be maintained
and expanded to support future requirements.
- The architecture provides defines clear interfaces to enable
partitioning for parallel team development.
- The designer of a model element can understand enough from the
architecture to successfully design and develop the model element.
- The packaging approach reduces complexity and improves
understanding.
- Packages have been defined to be highly cohesive within the
package, while the packages themselves are loosely coupled.
- Similar solutions within the common application domain have
been considered.
- The proposed solution can be easily understood by someone
generally knowledgeable in the problem domain.
- All people on the team share the same view of the architecture
as the one presented by the software architect.
- The Software Architecture Document is current.
- The Design Guidelines have been followed.
- All technical risks been either mitigated or have been
addressed in a contingency plan. New risk discovered have been documented
and analyzed for their potential impact.
- The key performance requirements (established budgets) have
been satisfied.
- Test cases, test harnesses, and test configurations have been
identified.
- The architecture does not appear to be
"over-designed".
- The mechanisms in place appear to be simple enough to use.
- The number of mechanisms is modest and consistent with the
scope of the system and the demands of the problem domain.
- All use-case realizations defined for the current iteration can
be executed by the architecture, as demonstrated by diagrams depicting:
- Interactions between objects,
- Interactions between tasks and processes,
- Interaction between physical nodes.
Overall
- Subsystem and package partitioning and layering is logically
consistent.
- All analysis mechanisms have been identified and described.
Subsystems
- The services (interfaces) of subsystems in upper-level layers
have been defined.
- The dependencies between subsystems and packages correspond
to dependency relationships between the contained classes.
- The classes in a subsystem support the services identified
for the subsystem.
Classes
- The key entity classes and their relationships have been
identified.
- Relationships between key entity classes have been defined.
- The name and description of each class clearly reflects the
role it plays.
- The description of each class accurately captures the
responsibilities of the class.
- The entity classes have been mapped to analysis mechanisms
where appropriate.
- The role names of aggregations and associations accurately
describe the relationship between the related classes.
- The multiplicities of the relationships are correct.
- The key entity classes and their relationships are consistent
with the business model (if it exists), domain model (if it exists),
requirements, and glossary entries.
- The model is at an appropriate level of detail given the
model objectives.
- For the business model, requirements model or the design
model during the elaboration phase, there is not an over-emphasis on
implementation issues.
- For the design model in the construction phase, there is a
good balance of functionality across the model elements, using composition
of relatively simple elements to build a more complex design.
- The model demonstrates familiarity and competence with the
full breadth of modeling concepts applicable to the problem domain;
modeling techniques are used appropriately for the problem at hand.
- Concepts are modeled in the simplest way possible.
- The model is easily evolved; expected changes can be easily
accommodated.
- At the same time, the model has not been overly structured to
handle unlikely change, at the expense of simplicity and
comprehensibility.
- The key assumptions behind the model are documented and
visible to reviewers of the model. If the assumptions are applicable to a
given iteration, then the model should be able to be evolved within those
assumptions, but not necessarily outside of those assumptions. Documenting
assumptions is a way of indemnifying designers from not looking at
"all" possible requirements. In an iterative process, it is
impossible to analyze all possible requirements, and to define a model
which will handle every future requirement.
- The purpose of the diagram is clearly stated and easily
understood.
- The graphical layout is clean and clearly conveys the
intended information.
- The diagram conveys just enough to accomplish its objective,
but no more.
- Encapsulation is effectively used to hide detail and improve
clarity.
- Abstraction is effectively used to hide detail and improve
clarity.
- Placement of model elements effectively conveys
relationships; similar or closely coupled elements are grouped together.
- Relationships among model elements are easy to understand.
- Labeling of model elements contributes to understanding.
- Each model element has a distinct purpose.
- There are no superfluous model elements; each one plays an
essential role in the system.
- For each error or exception, a policy defines how the system
is restored to a "normal" state.
- For each possible type of input error from the user or wrong
data from external systems, a policy defines how the system is restored to
a "normal" state.
- There is a consistently applied policy for handling
exceptional situations.
- There is a consistently applied policy for handling data
corruption in the database.
- There is a consistently applied policy for handling database
unavailability, including whether data can still be entered into the
system and stored later.
- If data is exchanged between systems, there is a policy for
how systems synchronize their views of the data.
- In the system utilizes redundant processors or nodes to
provide fault tolerance or high availability, there is a strategy for
ensuring that no two processors or nodes can 'think' that they are
primary, or that no processor or node is primary.
- The failure modes for a distributed system have been
identified and strategies defined for handling the failures.
- The process for upgrading an existing system without loss of
data or operational capability is defined and has been tested.
- The process for converting data used by previous releases is
defined and has been tested.
- The amount of time and resources required to upgrade or
install the product is well-understood and documented.
- The functionality of the system can be activated one use case
at a time.
- Disk space can be reorganized or recovered while the system
is running.
- The responsibilities and procedures for system configuration
have been identified and documented.
- Access to the operating system or administration functions is
restricted.
- Licensing requirements are satisfied.
- Diagnostics routines can be run while the system is running.
- The system monitors operational performance itself (e.g.
capacity threshold, critical performance threshold, resource exhaustion).
- The actions taken when thresholds are reached are
defined.
- The alarm handling policy is defined.
- The alarm handling mechanism is defined and has been
prototyped and tested.
- The alarm handling mechanism can be 'tuned' to
prevent false or redundant alarms.
- The policies and procedures for network (LAN, WAN) monitoring
and administration are defined.
- Faults on the network can be isolated.
- There is an event tracing facility that can enabled to aid in
troubleshooting.
- The overhead of the facility is understood.
- The administration staff possesses the knowledge to use
the facility effectively.
- It is not possible for a malicious user to:
- enter the system.
- destroy critical data.
- consume all resources.
- Performance requirements are reasonable and reflect real constraints
in the problem domain; their specification is not arbitrary.
- Estimates of system performance exist (modeled as necessary using a Workload
Analysis Model), and these indicate that the performance requirements are
not significant risks.
- System performance estimates have been validated using architectural prototypes,
especially for performance-critical requirements.
- Memory budgets for the application have been defined.
- Actions have been taken to detect and prevent memory leaks.
- There is a consistently applied policy defining how the
virtual memory system is used, monitored and tuned.
- The actual number of lines of code developed thus far agrees
with the estimated lines of code at the current milestone.
- The estimation assumptions have been reviewed and remain
valid.
- Cost and schedule estimates have been re-computed using the
most recent actual project experience and productivity performance.
- Portability requirements have been met.
- Programming Guidelines provide specific guidance on creating
portable code.
- Design Guidelines provide specific guidance on designing
portable applications.
- A 'test port' has been done to verify portability claims.
- Measures of quality (MTBF, number of outstanding defects,
etc.) have been met.
- The architecture provides for recovery in the event of
disaster or system failure
- Security requirements have been met.
- Are the teams well-structured? Are responsibilities
well-partitioned between teams?
- Are there political, organizational or administrative issues
that restrict the effectiveness of the teams?
- Are there personality conflicts?
The Use-Case View section of the Software Architecture Document:
- each use case is architecturally significant, identified as
such because it:
- is vitally important to the customer
- motivates key elements in the other views
- is a driver for mitigating one or more major risks, including any
challenging non-functional requirements.
- there are no use cases whose architectural concerns are already
covered by another use case
- the architecturally significant aspects of the use case are clear, and
not lost in details
- the use case is clear and unlikely to change in a way that
affects the architecture, or there is a plan in place for how to achieve
such clarity and stability
- no architecturally significant use cases have been missed (may require
some analysis of the use cases not selected for this view).
The Logical View section of the Software Architecture Document:
- accurately and completely presents an overview of the architecturally
significant elements of the design.
- presents the complete set of architectural mechanisms used
in the design along with the rationale used in their selection.
- presents the layering of the design, along with the rationale
used to partition the layers.
- presents any frameworks or patterns used in the design, along
with the rationale used to select the patterns or frameworks.
- The number of architecturally significant model elements
is proportionate to the size and scope of the system, and is of a size
which still renders the major concepts at work in the system understandable.
Topics
- Potential race conditions (process competition for critical
resources) have been identified and avoidance and resolution strategies
have been defined.
- There is a defined strategy for handling "I/O queue
full" or "buffer full" conditions.
- The system monitors itself (capacity threshold, critical
performance threshold, resource exhaustion) and is capable of taking
corrective action when a problem is detected.
- Response time requirements for each message have been
identified.
- There is a diagnostic mode for the system which allows
message response times to be measured.
- The nominal and maximal performance requirements for
important operations have been specified.
- There are a set of performance tests capable of measuring
whether performance requirements have been met.
- The performance tests cover the "extra-normal"
behavior of the system (startup and shutdown, alternate and exceptional
flows of events of the use cases, system failure modes).
- Architectural weaknesses creating the potential for
performance bottlenecks have been identified. Particular emphasis has been
given to:
- Use of some finite shared resource such as (but not limited
to) semaphores, file handles, locks, latches, shared memory, etc.
- inter-process communication. Communication across process
boundaries is always more expensive than in-process communication.
- inter-processor communication. Communication across process
boundaries is always more expensive than inter-process communication.
- physical and virtual memory usage; the point at which the
system runs out of physical memory and starts using virtual memory is a
point at which performance usually drops precipitously.
- Where there are primary and backup processes, the potential
for more than one process believing that it is primary (or no process
believing that it is primary) has been considered and specific design
actions have been taken to resolve the conflict.
- There are external processes that will restore the system to
a consistent state when an event like a process failure leaves the system
in an inconsistent state.
- The system tolerant of errors and exceptions, such that when
an error or exception occurs, the system can revert to a consistent state.
- Diagnostic tests can be executed while the system is running.
- The system can be upgraded (hardware, software) while it is
running, if required.
- There is a consistent policy for handling alarms in the
system, and the policy has been consistently applied. The alarm policy
addresses:
- the "sensitivity" of the alarm reporting
mechanism;
- the prevention of false or redundant alarms;
- the training and user interface requirements of staff who
will use the alarm reporting mechanism.
- The performance impact (process cycles, memory, etc.) of the
alarm reporting mechanism has been assessed and falls within acceptable
performance thresholds as established in the performance requirements.
- The workload/performance requirements have been examined and
have been satisfied. In the case where the performance requirements are
unrealistic, they have been re-negotiated.
- Memory budgets, to the extent that they exist, have been
identified and the software has been verified to meet those requirements.
Measures have been taken to detect and prevent memory leaks.
- A policy exists for use of the virtual memory system,
including how to monitor and tune its usage.
- Processes are sufficiently independent of one another that
they can be distributed across processors or nodes when required.
- Processes which must remain co-located (because of
performance and throughput requirements, or the inter-process
communication mechanism (e.g. semaphores or shared memory)) have been
identified, and the impact of not being able to distribute this workload
has been taken into consideration.
- Messages which can be made asynchronous, so that they can be
processed when resources are more available, have been identified.
- The throughput requirements have been satisfied by the
distribution of processing across nodes, and potential performance
bottlenecks have been addressed.
- Where information is distributed and potentially replicated
across several nodes, information integrity is ensured.
- Requirements for reliable transport of messages, such that
they exist, have been satisfied.
- Requirements for secure transport of messages, such that they
exist, have been satisfied.
- Processing has been distributed across nodes in such a way
that network traffic and response time have been minimized subject to
consistency and resource constraints.
- System availability requirements, to the extent that they
exist, have been satisfied.
- The maximum system down-time in the event of a server or
network failure has been determined and is within acceptable limits as
defined by the requirements.
- Redundant and stand-by servers have been defined in such
a way that it is not possible for more than one server to be
designated as the "primary" server.
- All potential failure modes have been documented.
- Faults in the network can be isolated, diagnosed and
resolved.
- The amount of "headroom" in the CPU utilization has
been identified, and the method of measurement has been defined
- There is a stated policy for the actions to be taken when the
maximum CPU utilization is exceeded.
|