INTERNET-DRAFT           Expires May 1997             INTERNET-DRAFT  Network Working Group                                    Matt Mathis
  INTERNET-DRAFT                      Pittsburgh Supercomputing Center
  Expiration Date:  May  Jan 1998                                 July 1997                                  Nov 1996

		Empirical Bulk Transfer Capacity
		< draft-ietf-bmwg-ippm-treno-btc-00.txt draft-ietf-bmwg-ippm-treno-btc-01.txt >

Status of this Document

This document is an Internet-Draft.  Internet-Drafts are working documents
of the Internet Engineering Task Force (IETF), its areas, and its working
groups.  Note that other groups may also distribute working documents as
 Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and
may be updated, replaced, or obsoleted by other documents at any time.  It
is inappropriate to use Internet-
     Drafts Internet-Drafts as reference material or to cite
them other than as
     ``work "work in progress.'' progress."

To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt''
"1id-abstracts.txt" listing contained in the Internet-
     Drafts Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).

Abstract:

Bulk Transport Capacity (BTC) is a measure of a network's ability to
transfer significant quantities of data with a single congestion-aware
transport connection (e.g. state-of-the-art TCP).  For many applications
the BTC of the underlying network dominates the the overall elapsed time for
the application, and thus dominates the performance as perceived by a user.
The BTC is a property of an IP cloud (links, routers, switches,
     etc) etc.)
between a pair of hosts.  It does not include the hosts themselves (or
their transport-layer software).  However, congestion control is crucial to
the BTC metric because the Internet depends on the end systems to fairly
divide the available bandwidth on the basis of common congestion behavior.
 The BTC metric is based on the performance of a reference congestion
control algorithm that has particularly uniform and stable behavior.

  Introduction

     This Internet-draft is likely to become one section of some
     future, larger document covering several different metrics.

  Motivation:

Introduction:

Bulk Transport Capacity (BTC) is a measure of a network's ability to
transfer significant quantities of data with a single congestion-aware
transport connection (e.g. state-of-the-art TCP).  For many applications
the BTC of the underlying network dominates the the overall elapsed time for
the application, and thus dominates the performance as perceived by a user.
 Examples of such applications include ftp FTP and other network copy
utilities.

The BTC is a property of an IP cloud (links, routers, switches,
     etc) etc.)
between a pair of hosts.  It does not include the hosts themselves (or
their transport-layer software).  However, congestion control is crucial to
the BTC metric because the Internet depends on the end systems to fairly
divide the available bandwidth on the basis of common congestion behavior.
     The BTC metric is based on

Four standard control congestion algorithms are described in RFC2001:
Slow-start, Congestion Avoidance, Fast Retransmit and Fast Recovery.  Of
these algorithms, Congestion Avoidance drives the performance steady-state bulk
transfer behavior of a reference TCP.  It calls for opening the congestion control algorithm that has particularly uniform window by 1
segment size on each round trip time, and
     stable behavior.  The reference algorithm closing it by 1/2 on congestion,
as signaled by lost segments.

Slow-start is documented in
     appendix A, and can be implemented in TCP using the SACK option
     [RFC2018]. part of TCP's transient behavior.  It is similar in style used to quickly
bring new or recently timed out connections up to an appropriate congestion
window.

In Reno TCP, Fast Retransmit and behavior Fast Recovery are used to support the congestion
     control
Congestion Avoidance algorithm which have been in standard use [Jacoboson88,
     Stevens94, Stevens96] in during recovery from lost segments.  During
the Internet.

     Since recovery interval the behavior of data receiver sends duplicate acknowledgements,
which the reference congestion control algorithm
     is data sender must use to identify missing segments as well defined and implementation independent, it will be
     possible confirm that different measurements only reflect
     properties as to
estimate the quantity of outstanding data in the network network.  The research
community has observed unpredictable or unstable TCP performance caused by
errors and not uncertainties in the end-systems.  As such BTC
     will be a true network metric.  [A strong definition of "network
     metric" belongs in the framework document: - truly indicative estimation of
     what *could* be done with outstanding data [Lakshman94,
Floyd95, Hoe95].  Simulations of reference TCP or another good transport layer -
     sensitive to weaknesses implementations have
uncovered situations where incidental changes in the routers, switches, links, etc. other parts of the IP cloud network
have a large effect on performance [Mathis96].  Other simulations have
shown that would also cause problems for production
     transport layers - *not* be sensitive under some conditions, slightly better networks (higher
bandwidth or lower delay) yield lower throughput [This is easy to weaknesses
construct, but has it been published?].  As a consequence, even reference
TCP implementations do not make good metrics.

Furthermore, many TCP implementations in common
     host hardware or software, such as current production use in the Internet today have
outright bugs which can have arbitrary and unpredictable effects on
performance [Comer94, Brakmo95, Paxson97a, Paxson97b].

The difficulties with using TCP
     implementations, that for measurement can be removed overcome by doing transport right on using
the hosts - complete as a methodology Congestion Avoidance algorithm by itself, in isolation from other
algorithms.  In [Mathis97] it is shown that little/no
     additional deep knowledge the performance of state-of-the-art measurement
     technology is needed Others the
Congestion Avoidance algorithm can be predicted by a simple analytical
model.  The model was derived in [Ott96a, Ott96b].  The model predicts the
performance of the Congestion Avoidance algorithm as a function of the
round trip time, and the TCP segment size and the probability of receiving
a congestion signal (i.e. packet loss).  The paper shows that may come the model
accurately predicts the performance of TCP using the SACK option [RFC2018]
under a wide range of conditions.  If losses are isolated (no more than one
per round trip) then Fast Recovery successfully estimates the actual
congestion window during recovery, and Reno TCP also fits the model.

This version of the BTC metric is based on the TReno ("tree-no")
diagnostic, which implements a protocol-independent version of the
Congestion Avoidance algorithm.  TReno's internal protocol is designed to
accurately implement the Congestion Avoidance algorithm under a very wide
range of conditions, and to mind.  - Guy Almes] diagnose timeouts when they interrupt
Congestion Avoidance.  In [Mathis97] it is observed that TReno fits the
same performance model as SACK and Reno TCPs.   [Although the paper was
written using an older version of TReno, which has less accurate internal
measurements.]

Implementing standard congestion control algorithms within the Congestion Avoidance algorithm within a diagnostic tool
eliminates calibration problems associated with the non-uniformity of
current TCP implementations.  However, like all empirical metrics it
introduces new problems, most notably the need to certify the correctness
of the implementation and to verify that there are not systematic errors
due to limitations of the tester.

     This version of the metric is based on the tool TReno (pronounced
     tree-no), which implements the reference congestion control
     algorithm over either traceroute-style UDP and ICMP messages or
     ICMP ping packets.

Many of the calibration checks can be included in the measurement process
itself.  The TReno program includes error and warning messages for many
conditions which that indicate either problems with the infrastructure or in some
cases problems with the measurement process.  Other checks need to be
performed manually.

Metric Name: TReno-Type-P-Bulk-Transfer-Capacity
(e.g. TReno-UDP-BTC)

Metric Parameters: A pair of IP addresses, Src (aka "tester")
and Dst (aka "target"), a start time T and initial MTU.

     [The framework document needs a general way to address additional
     constraints that may be applied to metrics: E.g. for a
     NetNow-style test between hosts on two exchange points, some
     indication of/control over the first hop is needed.]

Definition: The average data rate attained by the reference
                congestion control Congestion
Avoidance algorithm, while using type-P packets to probe the forward (Src
to Dst) path.  In the case of ICMP ping, these messages also probe the
return path.

Metric Units: bits per second

Ancillary results and output used to verify
          the proper measurement procedure and calibration:
          - results:
* Statistics over the entire test
(data transferred, duration and average rate)
          -
* Statistics from over the equilibrium Congestion Avoidance portion of the test (data
transferred, duration, average rate, duration and number
            of equilibrium congestion control cycles)
          - average rate)
* Path property statistics (MTU, Min RTT, max cwnd in
            equilibrium during Congestion
Avoidance and max cwnd during Slow-start)
          - Statistics from the non-equilibrium portion
* Direct measures of the
            test (nature and number analytic model parameters (Number of congestion
signals, average RTT)
* Indications of non-equilibrium events).
          - Estimated which TCP algorithms must be present to attain the same
performance.
* The estimated load/BW/buffering used on the return path.
          - path
* Warnings about data transmission abnormalities.
            (e.g
(e.g. packets out-of-order)
          - Warning out-of-order, events that cause timeouts)
* Warnings about conditions which may effect affect metric accuracy. (e.g (e.g.
insufficient tester buffering)
          -
* Alarms about serious data transmission abnormalities.
(e.g. data duplicated in the network)
          -
* Alarms about tester internal inconsistencies of the tester and events which
might invalidate the results.
          -
* IP address/name of the responding target.
          -
* TReno version.

Method: Run the treno TReno program on the tester with the chosen packet type
addressed to the target.  Record both the BTC and the ancillary results.
Manual calibration checks. checks:  (See detailed explanations below).
          -

* Verify that the tester and target have sufficient raw bandwidth to
sustain the test.
          -
* Verify that the tester and target have sufficient buffering to support
the window needed by the test.
          -
* Verify that there is not any other system activity on the tester or
target.
          -
* Verify that the return path is not a bottleneck at the load needed to
sustain the test.
          -
* Verify that the IP address reported in the replies is some an appropriate
interface of the selected target.

Version control:
         -

* Record the precise TReno version (-V switch)
         -
* Record the precise tester OS version, CPU version and speed, interface
type and version.

Discussion:

     We do not use existing TCP implementations due to a number of
     problems which make them difficult

Note that the BTC metric is defined specifically to calibrate as metrics. be the average data
rate between the source and destination hosts.  The
     Reno congestion control algorithms ancillary results are subject
designed to a number of
     chaotic or turbulent behaviors which introduce non-uniform
     performance [Floyd95, Hoe95, mathis96].  Non-uniform performance
     introduces substantial non-calibratable uncertainty when detect possible measurement problems, and to help diagnose the
network.  The ancillary results should not be used as
     a metric.  Furthermore a number metrics in their own
right.

The current version of people [Paxon:testing,
     Comer:testing, ??others??] have observed extreme diversity
     between different TReno does not include an accurate model for TCP implementations, raising doubts about
     repeatability
timeouts or their effect on average throughput.  TReno takes the view that
timeouts reflect an abnormality in the network, and consistency between different TCP based
     measures. should be diagnosed as
such.

There are many possible reasons why a TReno measurement might not agree
with the performance obtained by a TCP based TCP-based application.  Some key ones
include: older TCP's TCPs missing key algorithms such as MTU discovery, support
for large windows or SACK, or mistuning mis-tuning of either the data source or sink.

Some network conditions which
     need require the newer TCP algorithms are
detected by TReno and reported in the ancillary results.  Other documents
will cover methods to diagnose the difference between TReno and TCP
performance.

     Note that

It would raise the BTC metric is defined specifically to be accuracy of TReno's traceroute mode if the
     average data rate between ICMP "TTL
exceeded" messages were generated at the source target and destination hosts.  The
     ancillary results are designed to detect a number of possible
     measurement problems, transmitted along the
return path with elevated priority (reduced losses and in a few case pathological behaviors in queuing delays).
People using the network.  The ancillary results TReno metric as part of procurement documents should not be used as metrics
     in their own right.  The discussion below assumes
aware that in many circumstances MTU has an intrinsic and large impact on
overall path performance.  Under some conditions the TReno
     algorithm is implemented as a user mode program running under a
     standard operating system.  Other implementations, such as difficulty in meeting
a
     dedicated measurement instrument, can have stronger builtin
     calibration checks.

     The raw given performance (bandwidth) limitations of specifications is inversely proportional to the square
of the path MTU.  (e.g. Halving the specified MTU makes meeting the
bandwidth specification 4 times harder.)

When used as an end-to-end metric TReno presents exactly the same load to
the network as a properly tuned state-of-the-art bulk TCP stream between
the same pair of hosts.  Although the connection is not transferring useful
data, it is no more wasteful than fetching an unwanted web page with the
same transfer time.

Calibration checks:

The following discussion assumes that the TReno diagnostic is implemented
as a user mode program running under a standard operating system.  Other
implementations, such as thoes in dedicated measurement instruments, can
have stronger built-in calibration checks.

The raw performance (bandwidth) limitations of both the tester
and target SHOULD should be measured by running TReno in a controlled
environment (e.g. a bench test).  Ideally the observed
performance limits should be validated by diagnosing the nature
of the bottleneck and verifying that it agrees with other
benchmarks of the tester and target (e.g. That TReno performance
agrees with direct measures of backplane or memory bandwidth or
other bottleneck as appropriate.) appropriate).  These raw performance
limitations MAY may be obtained in advance and recorded for later
reference.  Currently no routers are reliable targets, although
under some conditions they can be used for meaningful measurements.  For most people  When
testing between a pair of modern computer systems at a few megabits per
second or less, the tester and target are unlikely to be the bottleneck.
TReno may not be accurate, and SHOULD NOT should not be used as a formal
     metric metric, at
rates above half of the known tester or target limits.  This is because
during the initial Slow-start TReno needs to be able to send bursts which
are twice the average data rate.

     [need exception

Likewise, if the 1st link to the first hop LAN is not more than twice as fast as
the limit in all cases?] entire path, some of the path properties such as max cwnd during
Slow-start may reflect the testers link interface, and not the path itself.
Verifying that the tester and target have sufficient buffering is
difficult.  If they do not have sufficient buffer space, then losses at
their own queues may contribute to the apparent losses along the path.
There are several difficulties in verifying the tester and target buffer
capacity.  First, there are no good tests of the target's buffer capacity
at all.  Second, all validation of the testers buffering depend depends in some
way on the accuracy of reports by the tester's own operating system.
Third, there is the confusing result that in many circumstances
(particularly when there is much more than sufficient average tester
performance) where insufficient buffering in the tester does not adversely impact
measured performance.

TReno separately instruments the performance of the equilibrium
     and non-equilibrium portions of the test.  This is because
     TReno's behavior is intrinsicly more accurate during equilibrium.
     If TReno can not sustain equilibrium, it either suggests serious
     problems with the network or that the expected performance is
     lower than can be accurately measures by TReno.

     TReno reports (as calibration alarms) any events where in which transmit packets
were refused due to insufficient buffer space.  It reports a warning if the
maximum measured congestion window is larger than the reported buffer
space.  Although these checks are likely to be sufficient in most cases
they are probably not sufficient in all cases, and will be the subject of
future research.

Note that on a timesharing or multi-tasking system, other activity on the
tester introduces burstyness burstiness due to operating system scheduler latency.  Therefore,
Since some queuing disciplines discriminate against bursty sources, it is very
important that there be no other system activity during a test.  This SHOULD
should be confirmed with other operating system specific tools.

In ICMP mode TReno measures the net effect of both the forward and return
paths on a single data stream.  Bottlenecks and packet losses in the
forward and return paths are treated equally.

In traceroute mode, TReno computes and reports the load on it contributes to
the return path.  Unlike real TCP, TReno can not distinguish between losses
on the forward and return paths, so idealy ideally we want the return path to
introduce as little loss as possible.  The best  A good way to test to see if the
return path is with TReno ICMP mode using has a large effect on a measurement is to reduce the forward
path messages down to ACK
     sized messages, size (40 bytes), and verify that the measured
packet rate is improved by a at least factor of two.  [More research needed]

     In ICMP mode TReno measures the net effect of both the forward
     and return paths on a single data stream.  Bottlenecks and packet
     losses in the forward and return paths are treated equally.

     It would raise the accuracy of TReno traceroute mode if the ICMP
     TTL execeded messages were generated at the target and
     transmitted along the return path with elevated priority (reduced
     losses and queuing delays).

     People using the TReno metric as part of procurement documents
     should be aware that in many circumstances MTU has an intrinsic
     and large impact on overall path performance.  Under some
     conditions the difficulty in meeting a given performance
     specifications is inversely proportional to the square of the
     path MTU.  (e.g. halving the specified MTU makes meeting the
     bandwidth specification 4 times harder.)

     In metric mode, TReno presents exactly the same load to the
     network as a properly tuned state-of-the-art TCP between the same
     pair of hosts.  Although the connection is not transferring
     useful data, it is no more wasteful than fetching an un-wanted
     web page takes the same time to transfer.
needed.]

References

     [RFC2018] Mathis, M., Mahdavi, J. Floyd,

[Brakmo95], Brakmo, S., Romanow, A., "TCP
     Selective Acknowledgment Options",
     ftp://ds.internic.net/rfc/rfc2018.txt

     [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Peterson, L., "Performance problems in BSD4.4 TCP",
Proceedings of ACM SIGCOMM '88, Stanford, CA., August 1988.

     [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The
     Protocols", Addison-Wesley, '95, October 1995.

[Comer94], Comer, C., Lin, J., "Probing TCP Implementations", USENIX Summer
1994, June 1994.

     [Stevens96] Stevens, W., "TCP Slow Start, Congestion Avoidance,
     Fast Retransmit, and Fast Recovery Algorithms", Work in progress
     ftp://ietf.org/internet-drafts/draft-stevens-tcpca-spec-01.txt

[Floyd95] Floyd, S., "TCP and successive fast retransmits", February 1995,
Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

[Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and
avoidance schemes".  Master's thesis, Massachusetts Institute of
Technology, June 1995.

[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Proceedings
of SIGCOMM '88, Stanford, CA., August 1988.

[mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment:
Refining tcp TCP congestion control",  Proceedings of ACM SIGCOMM '96,
Stanford, CA., August 1996.

[RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP Selective
Acknowledgment Options", 1996 Obtain via:
ftp://ds.internic.net/rfc/rfc2018.txt

[Mathis97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The Macroscopic
Behavior of the TCP Congestion Avoidance Algorithm", Computer
Communications Review, 27(3), July 1997.

[Ott96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary
Behavior of Ideal TCP Congestion Avoidance", In progress, August
1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to
ftp.bellcore.com

[Ott96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior in
TCP/IP with Constant Loss Probability", DIMACS Special Year on Networks,
Workshop on Performance of Real-Time Applications on the Internet, Nov
1996.

[Paxson97a] Paxson, V "Automated Packet Trace Analysis of TCP
Implementations", Proceedings of ACM SIGCOMM '97, August 1997.

[Paxson97b] Paxson, V, editor "Known TCP Implementation Problems",
Work in progress: http://reality.sgi.com/sca/tcp-impl/prob-01.txt

[Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
Addison-Wesley, 1994.

[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance,
Fast Retransmit, and Fast Recovery Algorithms",
ftp://ds.internic.net/rfc/rfc2001.txt

Author's Address
     Matt Mathis
     email: mathis@psc.edu
     Pittsburgh Supercomputing Center
     4400 Fifth Ave.
     Pittsburgh PA 15213

  ----------------------------------------------------------------
  Appendix A:

     Currently the best existing description of the algorithm is in
     the "FACK technical note" below http://www.psc.edu/networking/tcp.html.
     Within TReno, all invocations of "bounding parameters" will be
     reported as warnings.

     The FACK technical note will be revised for TReno, supplemented by a
     code fragment and included here.