Dec 2, 1999 - multiply-redundant bit-serial processing elements using a variety of fault-tolerant strategies;. â¢ memory cells; ... over long distanc...

3 downloads 0 Views 326KB Size

ANSWERS Autonomous Nanoelectronic Systems With Extended Replication and Signalling

Report reference: Date:

ANSW_1/FEB99 12/02/99

Algorithms and Architectures for use with Nanoelectronic Computers: 1

Author: MRB Forshaw,

Partners: University College London (UCL)

University College London, UK

Technische Universiteit Delft (TUD)

(Appendix: D. Berzon & M. Forshaw)

Universita degli Studi di Pisa (DIIET) Universität Dortmund (UNIDO-LBE)

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Contents INTRODUCTION................................................................................................................................................ 2 SECTION 1 : BACKGROUND .......................................................................................................................... 5 1.1. 1.2.

EFFICIENCY ........................................................................................................................................... 7 ALGORITHMIC COMPLEXITY ................................................................................................................... 7

SECTION 2 : DIGITAL COMPUTING PROSPECTS ..................................................................................... 8 2.1. THE HARDWARE PERFORMANCE ENVELOPE ............................................................................................. 8 2.1.1. Device packing density ...................................................................................................................... 8 2.1.2. Clock distribution............................................................................................................................. 9 2.1.3. Signal distribution........................................................................................................................... 13 2.2. ALGORITHMS FOR DIGITAL MACHINES .................................................................................................. 17 2.3. FAULT TOLERANCE IN DIGITAL LOGIC CIRCUITS ..................................................................................... 18 SECTION 3 : ANALOGUE/PROBABILISTIC COMPUTING PROSPECTS.............................................. 19 3.1.

ALGORITHMS FOR ANALOGUE MACHINES .............................................................................................. 21

SECTION 4 : HARDWARE DEPENDENCE: IMPLEMENTING ALGORITHMS WITH RTDS, SETS, QCAS AND OTHER DEVICES........................................................................................................................ 22 4.1. 4.2. 4.3.

RTDS .................................................................................................................................................. 22 SETS ................................................................................................................................................... 22 QCAS.................................................................................................................................................. 24

SECTION 5 : SUMMARY AND CONCLUSIONS.......................................................................................... 25 APPENDIX A: AN ELECTROSTATIC MODEL ENERGY EVALUATION OF SIMPLE QCA CIRCUITS APPENDIX B: THE CODE

1

30 40

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Introduction This document is the first deliverable item in the ANSWERS project, which is being carried out under the EC-funded MELARI Initiative as Project 28667. The aims of this project are to examine, and to try to answer, some of the problems that will arise when new, ever-smaller devices are used in computers. Existing computers are based almost entirely on CMOS technology, but it is known that CMOS devices will encounter major problems as they are made ever smaller. A number of alternatives have been proposed: these include the use of resonant tunnelling diodes (RTDs), single-electron devices (SETs) and quantum cellular automata (QCAs). Another alternative, namely rapid-switching flux quantum devices (RSFQs), is being investigated in another MELARI-funded project. Other devices have been proposed; some of these are considered briefly later on in this document. The acronym ‘ANSWERS’ stands for ‘Autonomous Nanoelectronic Systems with Extended Replication and Signalling’. Although a variety of possible algorithms and architectures will be examined in this project, it is likely that the dominant types of nanoelectronic systems will eventually include large, extended arrays of identical, relatively small, processing elements: hence the word ‘replicated’. These elements might be: •

microprocessors, using relatively conventional digital logic;

•

simple correlators or adder/multiplier units, packed together as closely as possible in order to provide raw number-crunching power for high-speed digital signal processing or database searching;

•

multiply-redundant bit-serial processing elements using a variety of fault-tolerant strategies;

•

memory cells;

•

‘neurons’ or other non-linear add-and-threshold devices, acting in either an analogue or probabilistic manner; or possibly

•

devices using quantum logic (as opposed to devices such as QCAs, which use quantum mechanical phenomena to implement relatively conventional digital logic processing).

The word ‘Autonomous’ in ANSWERS implies that the individual elements may operate almost independently of one another, only passing relatively small amounts of information between each other. This is not necessarily a desirable feature - some applications require large amounts of data to be passed synchronously and at high speed between processing elements. However, there are limits to the maximum speed of signal propagation over long distances. The ultimate limit is the speed of light, but a much more immediate constraint is the diffusive nature of electron propagation down aluminium or copper tracks. The further a signal has to be propagated, the harder it is to maintain synchronism between the transmitter and the receiver. Note that the term ‘long distance’ is not necessarily as large as 10 or 15 mm, which is the width of existing chips. Even existing processor designs have problems in sending high-speed signals synchronously over more than about 1 mm [10]. In the future, a ‘long distance’ may be as little as 0.01 mm (10000 nm). If processing elements cannot send synchronous signals to one another then they must work independently (autonomously). Of the three devices which are being examined in the ANSWERS project, resonant tunnelling devices are much the most advanced. Although they rely on quantum mechanical tunnelling for operation, are relatively hard to make using conventional silicon technology, and have other fundamental operational differences from CMOS, they can be made now (in small or moderate quantities). They are therefore a potential replacement for CMOS. Apart from the need for Europe to develop the necessary technology to use these devices, it is important to find out how they might eventually be used in very large numbers as device miniaturisation develops. In the ANSWERS project, the device technology of RTDs is being

2

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 examined by the University of Dortmund (UNIDO). UNIDO are working in collaboration with the University of Duisburg, who are fabricating RTDs as part of the LOCOM project. Single electron devices are already being used as highly precise current references, but the investigation of their behaviour as components in computers is less advanced. One major problem is that they can made relatively easily using current silicon technology, but their size is such that they must be operated at very low temperatures. The smaller the size, the higher the operating temperature, but operation at room temperature requires very small devices, with critical feature dimensions of the order of a few nanometres or less. Prototype SETs have been made to work at room temperature, but there is a long way to go before they can be made in quantity, and made to work reliably at room temperature [71]. One major problem facing SETs is their sensitivity to buried charge. These are trapped charges that are almost inevitably present near defects in any material which has a less-than-perfect crystalline structure. To complicate things further, these charges are not static, but can hop from site to site with a range of time scales. SETs are probably the first devices to encounter this problem seriously, but any semiconductor device working with very small numbers of electrons will have to overcome this problem. SETs can be therefore thought of as equal to RTDs in the race for nanoscale operation, because when RTDs are made with nanometre scale features, they too will face almost exactly the same problems as SETs. In the ANSWERS project, the properties of SET devices are being investigated at the Technische Universiteit Delft. Quantum cellular automata are the most speculative of the devices being investigated in the ANSWERS project; they are being investigated by DIIET at the University of Pisa. RTDs rely on the quasisimultaneous quantum mechanical tunnelling of many electrons: they are ‘large-current’ devices, at least at present. SETs rely on sequential tunnelling of individual electrons under the control of one or more gates: they are ‘small-current’ devices. However QCAs rely on the correlated tunnelling of only two electrons between pairs of quantum wells under the control of an external gate. Because the electrons never leave an individual QCA, the transmission of information from one QCA to another is by electrostatic forces. Thus QCAs are ‘zero-current’ devices. Given the small numbers of electrons involved, and the dimensional accuracy which is needed for them to operate reliably, it is evident that there are many problems to be overcome before QCAs can be considered as major candidates for use in nanoelectronic computers. Some people are almost completely dismissive of QCAs, or indeed SETs, as potentially viable devices. This is a somewhat intemperate view. SETs have been made to run at room temperature, and QCA devices have already been made, admittedly running below liquid helium temperatures. However, a more forceful rebuttal to such criticism can be based on the fact that trillions upon trillions of elements already operate successfully at room temperature, whose operation is based on the reliable movement of single electrons from one part of a system to another, sometimes using quantum tunnelling. Such elements are the molecules in any living creature (or virus). Although the designs of QCAs and SETs as presently envisaged may have to be modified as knowledge improves, it would be ill-advised for anyone to dismiss them outright. Although progress is being made into the development of devices such as RTDs, SETs and QCAs, it is important to anticipate what problems might arise when enormous numbers of such devices are assembled on a chip. Existing CMOS chip designs are already encountering problems with propagating signals from one side of a chip to another. Ever-smaller, ever faster devices are more expensive to manufacture, and power dissipation and device reliability problems are becoming harder to overcome. It was, of course, the anticipation of these problems that led to the writing of the first Semiconductor Industry Association (SIA) Road Map in 1994, with subsequent updated versions. These problems are not unique to the development of CMOS devices. Any assembly of semiconductor nanoscale devices, whatever their basic design principles, will have to face these problems.

3

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 The nature of these problems is such that it will soon be impossible to continue with existing designs of computer architecture at nanometre scales. If future ‘Pentiums’ or ‘DEC Alphas’ cannot be made ever smaller and faster, with ever-increasing on-chip memory capacity, then alternative architectures will be needed. In turn, alternative algorithms will be needed. It is this area which UCL is dealing with in the ANSWERS project.

Description of document – sections This document is organised as follows: section 1 provides a brief review of various aspects of algorithm and architecture design, with discussions of algorithmic efficiency and complexity. Section 2 considers the prospects for computers based on digital logic. Some aspects of the ‘performance envelope’ for hardware are considered in section 2.1, mainly the interlinked limits between speed, component density and track widths. The analysis in this section is intended to apply to any type of device which uses conventional ‘wires’ (tracks) to connect individual devices (QCAs are the odd one out, because they do not use wires). Sections 2.2 and 2.3 consider algorithms for digital machines and fault tolerance respectively. Section 3 considers the prospects for analogue/probabilistic computing. Neural networks fall into this category, but many existing neural network designs use digital techniques, or have only been implemented on conventional digital computers, so there is a strong overlap with digital logic. For example, some neural network designs are based on digital cellular automata. This section contains a brief discussion of algorithms for analogue/probabilistic computers. Section 4 discusses specific examples of algorithms and architectures which might be appropriate test vehicles for RTDs, SETs and QCAs, to be examined during the ANSWERS project. The section on QCAs briefly considers a recently proposed architecture which is intended as a quantum computer. The document finishes with a list of topics for further investigation, a brief bibliography, and an appendix containing the results of some work carried out at UCL on signal propagation down QCA ‘wires’. Much of this document is intended only to remind the reader that there are many, many factors which must be taken into account when considering algorithms and their suitability for implementation on a particular computer or class of computer. Inevitably, nearly all the examples relate implicitly to the current ‘CMOS digital computing’ era. It may well be the case that post-CMOS algorithms will have their own, different problems and peculiarities. In addition to this document, another document has been prepared by the ANSWERS partners, namely ‘Software Interface Definition’. This is mainly concerned with outlining the simulator software which the Universities of Pisa, Delft and Dortmund are preparing, but it also provides brief technical descriptions of the current status of each of the four partners. Both of these documents will shortly be made available on the UCL website (http://ipga.phys.ucl.ac.uk/reports/html) and on the ANSWERS home page (http://wwwbe.e-technik.uni-dortmund.de/~pacha/answers/answers1.html).

4

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Section 1 : Background An algorithm is a logical and/or mathematical recipe for solving a problem. Here we assume that an algorithm is formulated as a computer program, which in turn runs on some form of computer hardware. Nearly all hardware today is general-purpose, even if, like microprocessors in washing machines, many computers run only one algorithm for the whole of their working lives. The term ‘general purpose computer’ is merely a toned-down version of Turing’s ‘universal computing engine’. Such computers can in principle solve any problem, given enough time and memory. Even though limitless time and memory are never available, quite small machines are capable of solving an immense number of different possible problems, provided that someone has devised a suitable algorithm and turned the algorithm into executable code. However, there are good ways and bad ways to write algorithms or programs (the terms are almost interchangeable) for any given type of machine. A ‘good’ algorithm is usually thought of as being efficient in terms of speed, memory requirements, or maintainability of the code and its ease of use. A ‘really good’ algorithm will be both efficient and also conceptually elegant. So far the vast majority of algorithms have been written for von Neumann machines - single-processor, serial-instruction computers. Low-level parallelism in hardware for such machines (instruction pipelining and look-ahead, parallel integer and floating point arithmetic units etc) allows algorithm implementation to be speeded up, but this is hidden from the user. An increasing number of low-end commercial machines have two, four or more processors - compilers for such machines can usually handle parallel task allocation, but handcrafting is often still used. There has long been a tendency for designers of computer languages to favour conceptual elegance and to ignore possible inefficiencies of implementation. For example, the LISP interpreted language of some years ago was extraordinarily elegant to use in certain applications, but was also often extraordinarily slow. At a much lower level, it is well-known that program loops, where some simple instructions are repeated many times, can be ‘unrolled’ with consequent speed benefits. This ‘loop unrolling’ lies at the heart of most existing microprocessor designs, where look-ahead instructions are carried out, but it can also be used on a larger scale, at the program level. The problem with such speedup strategies, which are popular with some groups in the programming fraternity, is that they sometimes make the resulting program much harder for most humans to understand. The general consensus is that it is better to make the task of computer programmers as easy as possible, even if this sometimes results in significant inefficiency in the use of hardware. This approach has been very successful so far, precisely because hardware performance has been increasing in accordance with Moore’s law. It is too early to say whether a possible failure of Moore’s law will have any impact on future programming styles. Many computers have been built which have very large numbers of processors working in parallel. The term ‘very large’ used to mean anything from 32 powerful processors to more than 10,000 single-bit processors. Nowadays ‘very large’ usually implies a few hundred to a few thousand commercial processors (such as Pentiums or DEC Alphas), working in parallel: the current record-holder is the 9000processor ASCI Red at Sandia, with a speed of more than 1 Teraflops (1012 flops). Large-scale parallel processing engines have been available for 25 years or more, but have so far never occupied more than a niche market. This situation is likely to change when nanoscale electronics becomes feasible. The ‘best’ algorithms for such multi-processor machines are either perfect or near-perfect matches between the structure of the algorithm and the hardware layout of the machine. One application-specific example is the use of a ‘butterfly’ hardware architecture to solve fast Fourier transforms (FFTs). The Connection Machine [31] was designed as a general-purpose machine, but the hypercube connectivity of

5

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 the hardware was such that FFT algorithms ran with great efficiency. A different example is CLIP4, which, with a two-dimensional rectangular grid of processors, was designed for low-level processing of images, with one pixel per processor [14] [18]. Yet another example is where the solution of a prime number factorisation problem in cryptography was farmed out to hundreds of PCs and workstations, with each machine working on a part of the problem. Here the problem could be decomposed into a large number of small, independent tasks. Many algorithms do not run with great efficiency on machines with large numbers of processors. This often does not matter - what may be more important is that some sort of speedup can be achieved in comparison with running the problem on a machine with one, or a few, processors. Alas, sometimes this speedup is very poor. This may be because the programmer has written bad code. More often, the poor speedup is because the algorithm is inherently unsuitable for mapping on the particular array (for example trying to map an FFT algorithm onto a 2D-mesh array). Equally often, the poor speedup is because large amounts of data have to be transferred from one processing element to another - the input/output (I/O) problem - which overloads the interconnections. Yet again, the algorithm may be inherently serial in nature and only a modest degree of parallelism may be possible: many real-time plant control problems fall into this category [35]. A recent example of an attempt to develop a reconfigurable hardware system using field programmable gate arrays (FPGAs), produced speedups (over a single-processor SparcStation 10) of ~1 for binary heap (with 19 - 64 FPGAs), 6-12 for bubble sort (with 64-319 FPGAs), 41 for integer matrix multiply (with 43 FPGAs), and up to a 398 times speedup for transitive closure with 48 FPGAs [67]. We present these figures, not because this machine is much better or worse than other multiprocessor systems (apart from the advantage of reconfigurability), but simply to emphasise that efficiency (measured as speedup relative to a single-processor machine) is rarely equal to the degree of parallelism, or even linearly proportional to the parallelism. There is often no clear-cut distinction between an algorithm and the computer structure that is used to implement that algorithm. A fuller discussion of the multiplicity of layers, either explicit or implicit, which are incorporated in any real software-hardware combination will be given in later ANSWERS documents. Here we merely note that, for example, the source code, which is needed to run an arbitrary algorithm on a single-processor machine, will usually have to be significantly different if it is to be run on a multiprocessor machine. If the same source code can be compiled and run on two such different machines, then there will have to be two different compilers, one for each machine. There will therefore be two different software layers and hence, implicitly, a difference in the algorithm between the two machines. At the other extreme, and far more readily visible, is the implementation of specialised digital signal processing (DSP) chips which represent the embodiment of a single algorithm, or a small class of algorithms, in highly efficient dedicated hardware. For such devices the hardware is the algorithm. Predicting the future is a notoriously error-prone activity, as many famous individuals and would-be Delphi oracular groups have shown over the years. However, in the context of software and hardware development, it is possible to make some realistic extrapolations from what exists today, even if unexpected developments cannot be predicted. The SIA ROAD Map covers the development of CMOS hardware over the next ten years. However, one should note that this is not so much as prediction of future development as a specification of what is needed if Moore’s law is to be followed. It does not necessarily follow that the milestones in the SIA roadmap will be achieved on time, but it does at least provide a guideline. There are few such clear-cut guidelines for software, but one may predict with some certainty that, for the next dozen years or more, the vast majority of software-hardware installations will (in numerical terms) continue to be more-or-less well-known algorithms, implemented on more-or-less standard digital logic hardware. What will probably be different about such systems will be their speed and size, and the elegance or friendliness of the user interface. In addition, there will be a progressive increase in the number of ‘probabilistic’ or ‘analogue’ systems, not using Boolean logic. Instead, they will probably use algorithms which are extensions of existing probabilistic algorithms, which will be adapted and extended

6

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 to run on whatever hardware platforms are available. It is therefore important, to try to specify the future ‘performance envelope’ for both digital logic and analogue hardware/software combinations.

1.1.

Efficiency

There have been, and are, numerous measures of algorithmic efficiency. One measure of efficiency is algorithmic complexity, which is discussed in 1.2. However, the most popular measure of efficiency is speed. Speed is, of course, not necessarily related to efficiency. Speed ratings for different machines are readily available in popular computer magazines and ‘Fastest 500’ web sites. It is sometimes forgotten that these hardware speed ratings are measured by running one or more sets of algorithms. A related, enduringly popular measure of the efficiency of a hardware-software combination is the ‘benchmark’. Dozens of benchmarks exist for low end and high-end machines, from games playing on PCs to scientific problem solving on processor farms (e.g. BENCHWEB, SPEC, NAS Parallel Benchmarks etc.). These mostly measure how fast a given machine can handle various kinds of algorithm which are taken to be representative of a particular class of problem. We note in passing that one efficiency measure that is sometimes ignored is that of code maintainability and robustness. Numerous examples could be given of ‘spaghetti code’, written long ago, and patched and re-patched over the years for some commercial (or academic) application, which cannot be discarded because the vital application cannot be shut down long enough for the code to be rewritten (or else no money is available to rewrite it). Unfortunately, even more examples could be given of commercial systems software being promised but not delivered, because the task was too large to handle in a short time. The almost inexorable laws of combinatorial complexity usually result in large algorithms being exponentially hard to write and to debug.

1.2.

Algorithmic complexity

Algorithmic complexity theory is a branch of computing science and mathematics in itself [26][32]. We remind the reader of some of the pitfalls which can arise, using two very simple examples. In the first example, suppose that two algorithms are designed to sort N data points in ascending order on a given computer. One is considered to be more efficient if it requires O(Nlog(N)) operations and the other takes O(N2) operations. However, if the multiplying factor which is implicit in the O (‘order of’) symbol is very large for the first algorithm, and very small for the second, then the second algorithm may in fact be more efficient. To give a second example, a parallel computer with N processors might complete an image processing task on an N-pixel image in O(1) time units, against O(N) time units with a single-processor serial machine. If time is the criterion then the parallel machine is better. If silicon area is the criterion then the serial machine is better. Both will have much the same performance measure if the (chip area ⋅ time units) product is used to assess performance.

7

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Section 2 : Digital computing prospects In this section we briefly consider some of the many factors which are likely to limit the performance of future machines using conventional digital logic techniques, but perhaps with non-CMOS devices.

2.1.

The hardware performance envelope

The fundamental limitations on future hardware, due to such factors as thermal dissipation, velocity of light constraints, signal-to-noise requirements, Heisenberg’s uncertainty principle and so on, are well known [25]. In this section we consider one particular constraint - the limitations on electrical signal propagation speed in conductors such as silicide, aluminium or copper.

2.1.1. Device packing density First of all, we remind the reader about the prospects for packing enormous numbers of devices on a single semiconductor chip. Let f be the minimum feature size within a device (whether CMOS transistor, resonant tunnelling transistor (RTT) or single electron transistor (SET)) and let

D = af = A1 / 2

[2.1]

be the characteristic dimension of the device (a being a scale factor, A the device area). For current CMOS technology f is usually the gate length or the interline half-pitch (often denoted by λ), which varies from 250 nm for mainstream commercial technology down to 20-50 nm in experimental devices. The factor α varies from about 100 for out-and-out experimental CMOS or SET devices ([22] [21] [58]) through ~7-10 for CMOS transistors used in general-purpose logic (eg. [68]), down to values of 3-5 for new CMOS RAM ([38][50][37]). We make the sweeping (but reasonable) assumption that the relation between device area A and minimum feature size f for other devices (in particular, SETs and RTDs) will be similar to that for CMOS, at least within a factor of two or so. Then as device sizes decrease, the number of devices N per chip for a chip of width W will be

N ~ (W/D) 2

[2.2]

almost irrespective of the actual devices which are used. Representative values for N are given in Table 1. The reader will recall that the SIA 1997 Road Map hopes that the year 2012 will see CMOS device densities of about 5⋅1010 devices/cm2 (1.7⋅1010 bits) for DRAM, and about 1.8⋅108 transistors/cm2 for general logic. What Table 1 does not consider, is whether a given number of nanoelectronic devices per unit area can have the same processing power as the same number of CMOS devices. For example, it has been shown that multiple-input threshold logic gates can be implemented with fewer resonant tunnelling devices than is possible in CMOS, by a factor of 2 - 3 [52]. However, it is not yet known what fraction of chip area will be devoted to particular functions for particular applications, and hence what gains might be achieved by such means. This is one of the tasks of the ANSWERS project. 8

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

f (nm)

180

100

70

50

30

10

3

1

α=3

3⋅108

109

2⋅109

4⋅109

1010

1011

1012

1013

α = 10

3⋅107

108

2⋅108

4⋅108

109

1010

1011

1012

Table 1. Number of devices per cm2 chip, as a function of minimum feature size f and scaling factor α. Values

of α = 3 - 5 are representative of new CMOS RAM designs; α = 10 - 20 represents general microprocessor logic circuitry. A feature size f = 180 nm represents current top-end commercial practice; values of f = 10 - 50 nm represent, very approximately, the present experimental limits for minimum feature size. A value of f = 1 nm is approximately molecular scale.

Another factor that is not considered in Table 1, is how chip defects and real-time (transient) errors will affect the useful number of devices per unit area. Means for overcoming moderate numbers of permanent chip defects exist - for example, by having spare columns on RAM chips [40] or by re-routing round defective blocks on field programmable gate arrays [28]. However, it is almost certain that a much more important factor will be transient errors caused by smaller numbers of electrons being associated with each bit of information, by thermal excitation of stray charge carriers, and by background charge variations as carriers are released or trapped with varying time scales [85], by radioactive impurities, or by cosmic rays [82]. Relatively little information is available about such phenomena. However, some preliminary work by Spagocci [60] suggests that, at the smallest device dimensions, transient errors may make it necessary to use multiple redundancy on a massive scale (27-fold or even 81-fold) if conventional digital logic systems are to have comparable failure rates to present-day chips. Some aspects of this problem are discussed in section 2.3. Yet another factor that is not considered in Table 1, is the possibility of extending semiconductor designs into the third dimension. This can be done by stacking chips in a variety of ways (e.g. [4] [62]) or by trying to deposit more than one layer of active components on a single semiconductor substrate [19]. This report does not consider these two options further.

2.1.2. Clock distribution As device speeds increase, clock frequency limitations will have increasingly severe effects on what architectures are feasible with conventional digital logic, and hence on algorithm performance. It is wellknown (eg. [46], [80], SIA 1994 Road Map) that reductions in wire cross-sections will make it impractical to propagate clocked digital signals over long distances. Even if clock or data signals were propagated at the velocity of light, it would only be possible, from a single source, to provide synchronous signals over a circle of 15 mm radius in the plane of a chip for a 1GHz clock, assuming a maximum acceptable relative skew of 5% of the clock period between centre and edge (cf. eg [72] for earlier discussions of this kind of signal limit). With a 10 GHz clock synchronism can be maintained over 1.5 mm radius. If metallic conductors are used then the effective signal propagation velocity is lower than the velocity of light, and this speed loss becomes progressively worse as the conductor cross-section decreases. The remainder of this section is devoted to a simplified analysis of how reductions in conductor cross-section are likely to limit the maximum possible frequency at which a digital circuit can operate. The problem can be analysed with different levels of approximation (e.g. [46]). As track cross-sections get smaller, their impedances increase more rapidly than predicted from bulk resistance measurements (e.g. [8]). At even smaller cross-sections (~10 nm), quantum mechanical effects become increasingly significant ([50][51]). However, problems occur long before quantum mechanical effects become

9

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 important. The distributed RC transmission line propagation model of [77] shows that a perfect switched digital signal will propagate a distance dprop down a perfectly-terminated wire with square cross-section (width w) and reach 90% of its final value in a time t90 given by

d prop ≅ β w t90

[2.3]

1/ 2

where the proportionality constant β is ~ 1.5x108 sec-1/2 for a square aluminium conductor at a temperature of 50C, separated by a distance w/2 from a ground plane an insulator with dielectric constant value of 4. However, frequency-dependent inductance effects become significant around 1 GHz, and real drivers have a finite rise time. The full analysis is quite complicated (cf. e.g. [24]): for the moment we assume values of β equal to 1.5x108 for aluminium, 2.1x108 for copper, and 107 for silicide [54]. If clock skew problems are to be avoided (see Figs 1,2) then the time t90 must be a fraction ε of the clock period Tc:

dprop ≅ β w ε1/2 Tc1/2

[2.4]

A suitable value for ε would be 0.1 or 0.05. Of course, by using restoring inverters, buffers or driver chains, it would be possible to improve the propagation distance by a significant factor (cf. e.g. [77] or [78], p. 671), but we assume here that this is not possible.

Master Clock

A

time

Ideal optical propagation

A

Max Prop Distance

B

Propagation with RC delay

B

Clock Skew Region

εTc ; εmax ~ 0.05

Fig. 1. This illustrates the well-known problem, that a single-level clock signal can only be propagated over a limited area if signal skew problems are to be avoided.

10

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

100

Max Propagation Distance / mm

10

10 µm 1

1 µm 0.1

100 nm 0.01

10 nm 0.001

0.0001 0.1

1

10

100

Driving Frequency / GHz

Fig. 2. Finite track cross-sections limit signal propagation distances: this figure uses expression 2.4, with ε = 0.05, to show that, for example, at 300 MHz, tracks of 10 µm thickness are barely sufficient to pass a signal across a 20 mm chip unless additional buffers are used; at 1 GHz; a track of 10 nm thickness must be less than 10 µm long; and for a 10 GHz signal to travel 1 mm, the track must be at least 3 µm thick. These figures are based on the use of an aluminium conductor at 50C: copper would provide approximately 40 % better performance (see text for details). The use of copper at liquid nitrogen temperature (77 K) would provide approximately 1000% better performance.

Thus the maximum area A which can be covered synchronously from a single driver will be of the order of:

A ≅ 2 dprop2 = 2 β2 w2 ε Tc

[2.5]

A similar measure, the clock locality metric, has reportedly been developed by Bosshart, though full details are not available [46]. Suppose that the devices (and any surrounding real estate) which receive the clock signal are of characteristic size D = γw , where w is the wire size and γ is an appropriate scale factor. Note that γ will not in general be the same as α, the scale factor which relates the device size D and minimum feature size f. A typical value for γ is 2 - 4 (that is, a ‘typical’ track width will be one-quarter to one-half of the width of a ‘typical’ device). Then the maximum number of devices which can operate synchronously, Ndev , will be:

N dev

A 2 β 2 w2εTc 2 β 2εTc 2 β 2ε ≅ 2 = = = 2 γ 2 w2 γ2 γ Fc D

(

)

(

where Fc = 1/Tc is the clock frequency.

11

)

[2.6]

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

1.E+08

Number of Devices

1.E+07

1.E+06

1.E+05

1.E+04

1.E+03

1.E+02 0.01

0.1

1

10

100

log(Clock frequency in GHz)

Fig. 3 This shows how increasing the clock frequency limits Ndev , the maximum number of devices (‘transistors’) which can be reached synchronously by a digital signal which is distributed using Al tracks on a single level, that is, without the use of restoring inverters, clock trees or their equivalent. See text for details of parameters. Using Cu tracks would increase Ndev by about 40%. Reducing the operating temperature to 77K would allow Ndev to increase by about 1000% , because of the improved track conductivity.

Figure 3 plots Ndev against clock frequency for the values β = 1.5x108 sec-1/2, ε = 0.05 and γ = 3.2 . By using a multi-level distribution network it is possible to provide high-frequency synchronous clock signals over a much larger area than A in expression 2.5. Current microprocessors use a variety of clocking strategies (e.g. [23], [78]). For example, a 1.0 GHz, 64-bit integer processor from IBM [59] with 1M transistors in 10 mm2 area, and fdrawn = 0.25µm, distributes the master clock via a 6 mm long, 51µm wide spine, from which buffers distribute local clock signals over regions up to ~1.2 by 1.2 mm, containing a maximum of about 200K transistors. The track pitch for the local clocks is 0.9 µm. Details of the clock track thicknesses are not available, but if this value is used for weffective, then expression 2.4 then gives dprop ~ 1 mm and expression 2.5 gives A ~ 2 mm2 (Aactual ~ 1.4 mm2). Another microprocessor design from IBM [1] uses four levels for the clock distribution. The use of copper metallisation provides a reduction of about 40% in the RC time constant, in comparison with aluminium. The lowest level metal layer (M1) is used for the last stage of clock distribution: the thickness of this layer is 0.4 µm. With a 480 MHz clock, weff = 0.4 µm and β = 2.1x108 sec-1/2, expression 2.4 predicts that dprop will be ~ 900 µm. The authors state that total wire lengths were restricted to 500 µm. It is worth noting that 500 local clock splitters are used at the lowest distribution level. The 600 MHz DEC Alpha microprocessor provides a third example of how clocks can be distributed [5]. A master clock signal is generated in one corner of the chip (to reduce interference effects), then fed to the centre of the 17 by 19 mm chip. From the centre the clock is distributed symmetrically, using X and H trees, to the four quarters of the chip. Inside each quadrant is a fine rectangular grid of connectors (‘GCLK’), which is driven from all four sides by four RC trees. Further down the six levels of the clock hierarchy, subgrids are also used. The use of a grid has disadvantages - for example, increased capacitance and hence a poorer frequency response - but it also has advantages, such as universal availability of clock signals. The frequency response of such grids cannot be approximated by the simple dprop model.

12

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

2.1.3. Signal distribution Although clock signals can, with some effort, be distributed with effectively zero delay over relatively large areas, this is not possible with data signals. Even with existing commercial chips, with relatively modest clock frequencies of less than 1 GHz, there are certain critical data paths where signal propagation delays are starting to hinder device performance. These critical paths are typically associated with register-to-register arithmetic and data transfer from on-chip cache memory [11], and are usually no more than 1 - 1.5 mm long, with track widths around 1 µm , which is consistent with expression 2.4. The path length could be allowed to exceed dprop by a small factor, provided that all of the other operations - for example a floating-point multiply and a conditional test - can be implemented in a fraction of the clock cycle before the data are returned to a register. If this is not possible, then synchronisation requirements demand that some sort of clocked latch be inserted in the data path, to re-synchronise the signal. This immediately halves the effective data transfer rate. Data transmission over longer distances would have even lower effective rates. There are, in principle, ways for ameliorating these speed losses. Inserting restoring inverters in the signal paths will reduce the effects of RC delays (e.g. [78], p. 670), but there may not always be sufficient space to do this. Pipelining is a favourite suggestion for overcoming some of the effects of data propagation delays (e.g. [7] [52] [83] [87]). However, this cannot be used on the data-critical paths referred to above. Asynchronous operation is another theoretical option in some situations, but this introduces major problems of its own (e.g. [79]) and would not remove the delays in, for example, register-to-register arithmetic. Data or instruction parallelism can be spectacularly effective for some problems, but requires replication of processors, and we are considering here the structure of a single processor. One possible single-processor speedup option is to reduce the track resistance by cooling, which could give an improvement in dprop by a factor of 10 or more. This would have obvious limitations (in the form of an attached cooling engine) which it would clearly be desirable to avoid. What will the effects of delays in signal propagation have on ‘conventional’ microprocessor designs, as the devices (whether CMOS, RTDs, or other) get smaller? For example, suppose that we (temporarily) ignore all questions of circuit reliability, fault tolerance, and system redesign, and consider only size effects. What, for instance, would be the maximum clock frequency of a 32-bit chip with 10 nm minimum feature size? Or a 128-bit processor with 20 nm feature size? It is the critical data paths that provide the main limit on clock frequency in ‘conventional’ designs. For the purposes of illustration we consider a hypothetical integer addition unit and cache memory, with data being transferred from the cache to the adder, and the answer returned to the cache - see Fig 4.

Memory Block

Adder Block

Memory Block

Adder Block

Memory Block

N bit word

Adder Block

A) Two N-bit numbers collected from memory block and interleaved into Adder block

B) Addition process

C) Result stored in memory block

Fig. 4 Data paths involved in transferring two data words from memory, adding them, and returning them to memory.

Consider a carry lookahead adder (CLA) [30], which adds two words of length Lword bits. There are (1+log2(Lword)) processing layers in the adder. The first layer consists of Lword full adders. Each full adder

13

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 has an effective processing depth of two logic gates, with the signal path in each gate typically passing through two devices (depending on the design). If tdev is the average device delay then the first-layer signal delay will be approximately 4 tdev . The delays in the second, third …. (1+log2(Lword)) layers are approximately the same. In addition to these delays, the signals have to travel between devices in different layers. Because of the finite width D of every device, the track lengths at each stage are approximately 3D, 6D, 12D….3(Lword/2)D, with the factor of three being due to the number of inputs to each first-stage full adder. From expression 2.1 the propagation time tprop for a track of length l and width w is tprop = l2/(w2 β 2ε), with β ~1.5x108 sec-1/2 for Al (107 for silicide) and ε having the value 0.05 – 0.1: here we shall set ε = 0.1. The worst-case time for a signal to enter the top of the adder and reach the lowest stage is the sum of the ‘down’ and ‘across’ times:

t = 4t dev (1 + log 2 ( Lw )) + 9(1 + 22 + 4 2 + .....(Lw / 2) 2 ) D 2 /( β 2 w 2ε ) ≅ 4tdev (1 + log 2 ( Lw )) + 3D 2 Lw /( β 2 w 2ε ) 2

[2.7]

The time tdev for a signal to cross a device (a ‘transistor’) is a very complicated function of the device parameters and operating conditions. It is possible to devise estimates of tdev for CMOS (e.g. [54]), for RTDs and for various other devices. Here we provide a greatly simplified estimate of tdev for CMOS, based on the scaling analysis given in [54]. This can then be used as a benchmark for comparing the performance of other devices. We approximate tdev by : tdev = (trise + tfall)/2;

trise ~ 4C/( βn V);

tfall ~ 4CL/( βp V);

[2.8]

where CL is the load capacitance, V is the operating voltage = 3.106 f (here we set V(f = 1µm) = 3V), and

βn = (µn εdielε0 / tox)(Wn/Ln) ,

[2.9]

with a similar expression for βp . Although the electron mobilities µn and µp differ from each other by a factor of four, and are functions of the gate length L, for simplicity we ignore this and replace them both by a single constant µ= 0.1 m2/Vs . We set the dielectric constant εdiel = 4, and the permittivity of free space ε0 is 9x10-12 F/m. The ratio of the electrode width to length (W/L) is set to 3. The oxide thickness tox is (approximately) proportional to f, the minimum feature size: here we set the constant of proportionality η to 0.05. Hence

β ~ 1.4f /1010

[2.10]

We set CL ~ εdielε0 Agate/ tox = εε0D2/4ηf = εdielε0 24f2/4ηf = 4.3f/109 , assuming that Agate ~ D2/4 and that D2, the device area, equals 24f2. approximate estimate for the device delay tdev : tdev ~ 4 f /105

[2.11] Hence we obtain a (very) [2.12]

Since the signals in a CLA effectively traverse the unit twice, the shortest time tminadd in which such an adder can add two words of Lword bits is therefore: tminadd

~ 800f2(1 + log2(Lw)) + 144 f2 Lw2/β2w2ε = 1.6x10-4 f(1 + log2(Lw)) + 1.6 Lw2 / 1014

[2.13]

assuming metal (Al) tracks of width w = 2f. Note that, although the calculations have been based on the assumption of a CLA unit, the conclusions are not expected to alter very much if some other adder type is assumed instead. Detailed calculations will be carried out in the near future.

14

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 Having obtained a crude estimate of the minimum possible delay for the adder, we now consider the delays involved in accessing the cache memory. This is a complicated subject, and so for the present we base the following illustrative calculation on the classic four-quadrant memory structure (e.g. [47], [80], [86]). We assume that the memory consists of a square array of Lword1/2 by Lword1/2 units, with each unit being a four-quadrant memory containing Nword bits of storage. The time to retrieve one bit of data from one of these units will be the sum of the times for address line decode, word line enable and bit line propagation: tsubmem ~ tdev e loge (N1/2/2) + tdev (N1/2/2) + tloaded_line

[2.14]

where tdev is the transit time for one device (‘transistor’) and e = 2.7183…. The time delay of the loaded line will be, very approximately, tloaded_line ~ Rtrack(Ctrack + Cloads)

[2.15]

The track resistance Rtrack = ρ l /w2, where l, the line length, equals (Nword1/2/2)wbit (since the line length is equal to the width of a single bit-cell, multiplied by Nword 1/2/2). We set the width of a memory cell, wbit, to 6f . The resistivity ρ ~ 2.10-6 Ω m for silicide. We assume that w = 3f and that the track capacitance Ctrack is approximately 2εdielε0l = 4.3 N1/2 f /1010. Since N1/2/2 devices load the line, the capacitative loads will be: Cloads ~ (Nword 1/2/2)Cgate = 4.3f(Nword 1/2/2)/109 = 2.1 Nword1/2 f /109

[2.16]

Thus tloaded_line ~ 0.73Nword / 1015 + 2.8Nword / 1015

[2.17]

with the first term being the delay due to the unloaded track, the second being the delay due to the capacitative loads of the cells. Thus:

tsubmem ~ (4f/105)(1.9log2(Nwords) + Nwords1/2/2 – 1.9) + 3.5 Nwords/1015

[2.18]

This is the time for one bit of data to be accessed from one sub-memory. We make the further assumption that the worst-case time for this data to be passed to the adder involves an extra delay of L word ½ multiplied by the unloaded line time (0.73Nword / 1015), if we assume that the units are arranged in a square, with signal conditioners spaced alongside each sub-memory. Since two words have to be extracted from the memory, and one word has to be returned to the memory after addition, the total time for memory access/storage will therefore be approximately:

tminmem = 3(time to extract data from submemory + time to send to the adder) + addition time = 3[(4f/105)(0.95 log2(Nwords) + Nwords1/2/2 – 1.9) + 3.5 Nwords/1015 + 0.73 Lword ½Nword / 1015]

+

1.6x10-4 f (1 + log2(Lw)) + 1.6 Lw2 / 1014

where f = minimum feature size, Lword = word length in bits (32, 64…..128),

15

[2.19]

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 Nword = number of words in cache memory. The maximum clock frequency Fmax is then 1/ttotal . Figure 5 illustrates this relationship.

10

64 bit

32 bit

32 bit 64 bit

128 bit

max Operation speed/ Gops

128 bit

1

0.05µm 0.25µm

0.1 1

10

100

1000

10000

100000

No Words

Fig. 5. This plot illustrates how the extraction of two 32, 64, or 128-bit words from a single-stage cache memory, their addition and the subsequent return of the answer to memory, is limited by expression 2.19, which approximately defines the relation between the word length (in bits), the number of words stored in memory, and the maximum number of operations/second at which the process can be run. A minimum feature size of 0.25 µm is representative of current microprocessor designs; a value of 0.05 µm (i.e. 50 nm) is close to the limit of technology projections (and also represents the limits of the validity of the theory). Note that the clock frequency will have to be 2 – 4 times higher than the number of operations/second, depending on the number of words and the word length. See text for details.

Note that expression 2.19 does not specify the minimum clock period (or maximum clock rate): it specifies the time to carry out the particular operation of reading two words from memory, adding them, and then returning the answer to memory. The maximum clock rate will have to be higher than this by a factor of two to four, depending on the word length and the number of words in memory. Expression 2.19 provides an optimistic prediction (perhaps by a factor of two) for the maximum allowable number of read-add-write operations per second for a ‘conventional’ CMOS microprocessor with a word length of Lword bits and a level 1 cache memory of Nwords words. Its predictions for small feature sizes will definitely be too optimistic, because it uses a very simple scaling model. The scaling factors are known to deteriorate as the feature size diminishes (cf. e.g. [89]). The analysis can of course be improved in a variety of other ways. These include: more specific models of device (‘transistor’) operation, and better models of other factors such as driver/track impedance mismatch, memory layout, adder (or FPU) design and so on.

16

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 Much more important, however, is that the analysis which led to expression 2.19 can be modified to apply to other devices, in particular RTDs. A similar analysis will be made for the multiplication and division of two numbers. Although it seems certain that SETs are unlikely to be usable in binary logic devices (because of buried charge fluctuations), some time will be spent in looking at how SETs might be used to implement digital logic memories, adders and multipliers. This would complement the work on QCA binary logic which has been carried out at UCL ([74], [75], [76]). It is also obvious that expression 2.19 cannot apply down to indefinitely small track cross sections. When does it start to fail? The answer is: when quantum effects affect the resistivity. These effects appear when the effective track width falls below about 0.5 to 0.4 µm [8]. For example, a wire of 0.25 µm width has an effective resistivity which is twice that of a wire of 0.6 µm width. This would reduce the maximum clock frequency below the value given by expression 2.15. It has been suggested that ballistic electron conduction (i.e. conduction without scattering from phonons, surface defects etc.) might reduce the effective resistivity [24]. At smaller wire sizes, less than about 10 nm, other quantum effects need consideration [49][64]. Although expression 2.19 is (deliberately) simplified, a number of conclusions can be drawn immediately from it and from Figure 5. For a minimum feature size of 0.25 µm (current commercial devices have f = 0.35 or 0.25 µm), Figure 5 shows that a 32-bit adder with a 32 kbit memory (i.e. 1000 words) could run at a maximum rate of approximately 1 Gops, with a clock rate of 2 or 3 GHz depending on the design. Real systems are a factor of two slower than this, because expression 2.19 is only approximate (and uses an optimistic value for ε, the skew fraction in expression 2.4). The speed decreases only slightly, from 1.1 to 0.9 Gops, as the word length increases from 32 to 128 bits, for a 1 Kword memory. However, with a minimum feature size of 50 nm, the relative fall-off with word size is much larger – a 32-bit adder with a 1 Kword memory would operate at perhaps 3 Gops (with a clock speed of 6 – 9 GHz), but a 128-bit adder would only run at 1.5 Gops (clock speed ~ 3 – 5 GHz). Operations requiring long word lengths, such as high-precision floating point arithmetic, would therefore suffer. In particular, for memory sizes of less than 1 Kwords, the processing speed reaches a plateau for long word lengths, almost independent of the memory size. This is because almost all of the time is taken up by the addition operation. We may interpret this plateau effect to imply that it will be difficult with CMOS, if not impossible, to extract two 128-bit numbers from a register or latch, add them and return them to the register, at a rate of more than about 2 Gops, with a clock speed of 4 - 6 GHz (very approximately). The limit for 64-bit and 32-bit numbers would be, very approximately, 10 Gops (~25 GHz) and 20 Gops (~50 GHz respectively. Given that CMOS devices can work with gate delays of 10 – 20 picoseconds, this is a potentially serious constraint. However, the curves in figure 5 relate only to CMOS. Since RTDs can already operate up to 66 GHz [70] and higher, the ultimate performance limits for RTD will be much higher. Equivalent calculations for RTDs will be carried out in the near future. The two other important operations are, of course, multiplication and division. The calculation which led to expression 2.19 was based on the use of a CLA adder: a similar calculation based on integer and floating point multipliers/dividers has not yet been carried out. It will also be necessary to examine the possibility of improvement for most of the critical paths. In principle this can be done using a succession of increasingly powerful drivers and thicker tracks for long-distance connections (cf. e.g. [47]) but may not be practical for many of the tracks considered here.

2.2.

Algorithms for digital machines

Given the speed constraints implied by expression 2.19, it is necessary to ask how more complicated algorithms than simple addition or multiplication might be affected. For example, how much would an equation solver be slowed up? Would it be impossible to speed up a big fast Fourier transform (say 1024 by 1024), even if one could shoehorn all of the components onto a single chip? Would a miniaturised,

17

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 portable, voice and face recognition system be impossible to implement, even with 1012 devices on a single low-power chip? These questions can be answered in different ways. First, it should be understood that no existing algorithms or problems need run slower on future machines of conventional design, merely that there will come a time when increasing the clock speed will not speed up their solution. Thus the speed limit implied by expression 2.19 does indeed seem to be a fundamental limit (if one assumes that cooling circuits to 77K is undesirable). Thus for example it will never be possible to run the FFT, which is a mixed serial-parallel algorithm, beyond some maximum clock rate, which will depend on the number of data points to be transformed. This will be true, not matter how small the computing elements, or how many, or how fast the clock, if one assumes the use of ‘conventional’ clocked digital logic circuits based on planar semiconductor technologies. On the other hand, the limits on an equation solver will depend on the structure of the particular equation, and of the particular algorithm which is used to solve the equation. Some equations are amenable to being decomposed into more-or-less independent components. Provided that they are mapped onto a suitable architecture, using a suitable algorithm, then speedups are possible. The investigation of the computational complexity of algorithms, in time and space, is a well-trodden field, and future ANSWERS reports will provide more detailed descriptions of the classes of problem which are subject to clock speed limitations and which are not. For the moment we remind the reader that, in terms of computer architecture, the algorithms which will be most amenable to speedup are those where independent smallscale (‘fine-grained’) calculations are possible, with relatively little data transfer from one computing cell to the next.

2.3.

Fault tolerance in digital logic circuits

It is important to distinguish between static and dynamic fault tolerance. Existing CMOS-based computer chips, as they emerge from the manufacturer’s works, have undergone a set of tests which are designed to eliminate chips which have any permanent defects which prevent the chip from working. Such defects may arise from wafer defects or from faults in the multilayer deposition processes For memory circuits, where the packing density is higher than for general-purpose logic, redundant components are used, which can be switched in to replace permanently faulty devices (e.g. [40], [86]). On-chip redundancy is little used in mainstream microprocessor chips, but field-programmable gate arrays offer the chance of providing on-chip redundancy for general-purpose logic (e.g. [20], [66], [69] [84]). The extreme example of how this potential redundancy might be used is the Teramac reconfigurable computer, which is designed to work with only 7% of its FPGAs, 4% of its connections and 7% of its circuit-board pathways working [28]. This is an impressive system, but the reader should be aware that the ‘Tera’ in the machine’s name refers to bit operations per second. With only 7% of its 512 chips operational, the Teramac would presumably only be capable of about 200 Mflops, roughly comparable with a single medium-sized workstation. We make this point, not to criticise the Teramac, but to remind the reader that FPGAs, because of their general-purpose architectures, are relatively inefficient in comparison with dedicated microprocessor designs. Reconfigurability is extremely useful, but is only capable of correcting for static faults, i.e. permanent defects in the circuit. Dynamic faults need completely different solutions. In existing devices they can arise in a variety of ways, for example by the passage of ionising particles, from radioactive materials in the chip, by marginal components causing occasional errors, by electrical interference and by fluctuations in signal levels. The probability of failure in one clock cycle for any one device may be very small, but the probability of failure over say a year’s worth of clock cycles for each of 107 (or 108 , 109 …. 1012) devices can rapidly produce unacceptably poor system reliability. Existing memory devices use parity checks, or some cases one-bit error correcting codes, to achieve acceptably low dynamic fault rates. However, it is certain that dynamic error rates will increase as device sizes get smaller and device numbers get larger. Multiple-bit error correcting codes will provide a modest degree of protection against 18

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 soft memory errors, but this approach requires ever-increasing word lengths. Protection against dynamic errors in digital logic circuits can take three main forms – to use some error-correcting codes; to repeat the calculation three, five …times and to use majority voting among the answers; or to duplicate the hardware three, five… times and to use majority voting. The first approach is really only effective in reducing transient faults in memory, and is not suitable for improving the performance of arithmetic units. The second option reduces the effective clock rate by 3, 5…. ; the third approach reduces the number of devices by 3, 5…. . Since the dynamic error rate for a device mainly depends on the physics of the device operation, it is necessary to have a good physical model of the device characteristics before dynamic error rates can be estimated and fault-tolerant techniques devised. This is one of the aims of the ANSWERS project.

Section 3 : Analogue/probabilistic computing prospects The overwhelming majority of machines today use only digital logic, but before the early 1960s many computers were analogue in nature. There was a continuing interest in analogue machines of various kinds during the 1970s and 1980s, although these were usually specialised devices. The resurgence of interest in neural networks in the early eighties resulted in many analogue/probabilistic systems being simulated on digital computers, a method that is known to be inefficient but which has the great advantage of programming flexibility. The algorithms which are explicitly or implicitly associated with these analogue structures or networks are relatively simple to write down, but often hard to define in terms of their performance, for any but the simplest network. This distinguishes them from many digital logic algorithms, which are sometimes amenable to rigorous mathematical or statistical analysis (although there are of course many digital algorithms which are not simple to analyse - see eg [26][32]). The dichotomy arises, of course, from the so-called ‘emergent complexity’ of closely coupled assemblies of nonlinear analogue components. It is somewhat ironic that a different type of analogue system, namely the quantum mechanical interaction of electrons and atoms in nanoscale circuits, is astonishingly difficult to simulate on any sort of digital computer [61]. It is, however, a class of analogue system that is central to the MELARI programme). One particular ‘analogue computing’ line which has continued for many years, but always at a low level, is the direct implementation of neural network elements in silicon, mainly but not always in CMOS. Many of these devices used ‘tricks’, based on the peculiarities of CMOS electronics; most of them never progressed beyond the experimental prototype stage; nearly all of them were research-level implementations of extremely simple algorithms, which would be of little use in solving real-world problems. There were some notable exceptions, for example the devices developed by Carver Mead and colleagues, which represented a significant improvement in algorithmic complexity and usefulness over earlier efforts. Another development line was that of quasi-analogue devices such as ‘artificial retinas’, where (conventional) on-chip processing is associated with each pixel in a CCD or CMOS sensor. This approach provides a halfway house between the high speed and packing density (but algorithmic inflexibility) of completely analogue circuitry, and the algorithmic flexibility (but relatively low performance) associated with general-purpose algorithms written for general-purpose digital computers. Although cellular automata are usually considered to be digital in nature, they have much in common with some types of neural networks. For the purposes of this section they will be considered to be ‘analogue’ systems (but see section 4 on the possible applications of QCAs). Neural networks can be used to implement arbitrary logic functions, because they are made up of ‘neuron’ processing elements. These elements can be used, either individually or in small groups, to implement arbitrary Boolean functions. It is therefore possible in principle to use a neural network as a

19

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 universal computer. This possibility, combined with the known ability of neural networks to be tolerant to transient defects and static faults (at least in some circumstances) has sometimes been used as an argument for the use of neural networks as the ‘computers of the future’. This argument has some merit. One could for example cite de Garis’s ‘CAM Brain Machine’ [41] as an example of how large numbers of simple cellular automaton ‘neural’ elements can be connected and ‘trained’, using genetic programming techniques, to carry out complex tasks. As an example of fault tolerance, one could cite many papers where the random disconnection of neurons or their connections can be shown to have relatively little effect on the network performance. This argument also has some severe deficiencies. It is indeed possible to make an array of regularlyconnected cellular automata, with each cell implementing the same logical function on its inputs, and to arrange that each cell works in such a way that the whole array emulates the behaviour of a conventional digital computer. The classic example of this is when, in 1977, students at MIT used Conway’s cellular automaton game of ‘Life’, with its glider guns, traffic lights and other exotic creatures, to simulate parts of a digital computer (e.g. [88]; see also [73]). Unfortunately, it took approximately 104 cells to duplicate an OR gate. Although ingenious and amusing, this example illustrates that using cellular automata with fixed rules to emulate an arbitrary logic function is extremely inefficient. Another example, which should be familiar to the reader, is that the human brain, with perhaps 1014 synapses and 109 neurons of a hundred different types, is barely capable of multiplying two five-digit numbers together without the aid of auxiliary cache memory in the form of a sheet of paper (admittedly, animal brains are not exactly describable as regularly-connected arrays of cellular automata). The conclusion one should draw, is that the architecture should match the algorithm. If a particular type of processing element, assembled in a particular architecture, is unsuitable for solving an arbitrary problem, it there a way to reconfigure it, that is, to train it to solve the problem better? The answer is partly ‘yes’. One can use a range of training methods to improve the performance of neural networks such as multilayer perceptrons (MLPs). Alternatively, one can ‘breed’ for improved performance by having multiple copies of a system, then using genetic algorithms to select the bestperforming systems over a sequence of generations. This is the approach which is being taken with the cellular automaton CAM system [41]. This system is being developed with the hoped-for intention of mimicking the behaviour of a small animal (according to the promotional literature, a kitten). Whether this approach will be successful remains to be seen, but it is not clear to the writer whether such an approach would be particularly successful in designing (for example) an efficient 32-bit ALU with instruction look-ahead, since it has long been known that existing neural network designs are spectacularly poor at storing representations of binary digits. This applies with even more force, if fault tolerance is needed. One of the many reasons which have been put forward as arguments for the use of neural networks, whether analogue or digital, has been that they are potentially capable of fault tolerant operation. The idea is that by spreading the information to be stored over many ‘neuron’ processing elements, then the loss of a few neurons will only produce a modest reduction in the signal-to-noise ratio of a retrieved pattern or classifier output. With certain restrictions, this can indeed be true. There is no direct comparison with conventional digital fault-tolerant techniques, but a very crude example may illustrate the problem. Suppose that a neural network has store information about a single number, say between 0 and 15, by ‘spreading’ it over five neurons, each capable of storing numbers between 0 and 3, perhaps using a ‘thermometer’ representation for number storage (i.e. using linear coding instead of binary coding). Adding the signals from the five neurons would restore the signal, and losing one of the neurons would only cause the reconstituted signal to be 20% in error. The equivalent digital logic circuit might use fivefold modular redundancy, where the number is stored in five different memory locations, whose outputs are all fed to a majority logic gate. The neural network approach is apparently more economical (5 times 2 = 10 bits, plus adder), but cannot be guaranteed to return an exact answer. The digital approach requires more storage (5 times 4 = 20 bits, plus comparator), but it is (almost) 100% reliable. To increase the retrieval accuracy of the neural network memory is computationally very expensive.

20

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

3.1.

Algorithms for analogue machines

There are many different types of neural networks, and even more different types of neural network algorithms. However, to simplify things drastically, one may classify neural network types into ‘feedforward’ and ‘feedback’. The former type includes most multilayer perceptrons (MLPs), while the latter includes Hopfield nets and self-organising maps (SOMs). The last are usually two-dimensional grids of cells with variable-size neighbourhoods, but can be considered to be feedback networks (see Figures 6 & and 7 in Section 4). For descriptions of these and other networks see e.g. [27], [39], [55]. In the present document we will focus on only two aspects of neural networks: first, their need in general to have long distance connections, and second, the need to ‘train’ them, that is to program them. The problems associated with long-distance connections have already been emphasised in section 2, and in general, neural network structures face even greater communication difficulties than conventional digital processors do, because they need more wires. This connectivity problem will be revisited in a later ANSWERS document. Although in theory many neural network algorithms do not need clocked logic, their direct implementation in hardware would sometimes be easier if clocking were to be used. Even without clocking, extended neural networks face almost exactly the same problems as clocked digital circuits: data have to be passed long distances, and the system performance will be limited by signal propagation delays. Training or programming is an equally difficult problem. When neural network programs are run on digital computers there is no problem in updating simulated weights (synapses), but adjustable-weight hardware is very difficult to implement in CMOS, and is likely to be equally difficult to implement with any other device. The most successful neural networks avoid this problem by having digital weights (or indeed digital neurons). For the moment, we will assume that any ‘analogue’ data in some nanoelectronic neural network will be stored in digital form. Because dynamic faults are likely to be significant, we will assume that such data will be stored in the linear ‘thermometer’ form and not in a binary representation. Several systems have recently been proposed, which use evolutionary or genetic algorithms to ‘train’ neural networks (e.g. [13], [16], [41]). It should be pointed out that such algorithms, in which an optimal solution to some problem is found by ‘breeding’ for optimality, would represent a particular challenge for implementation in neural network or analogue hardware. Although the use of FPGAs appears to offer a convenient way to reconfigure connections or weights during training or ‘evolution’ (see e.g. [42]), it was pointed out earlier that FPGA structures are relatively inefficient in terms of packing density. Given also that very small devices are likely to have poor dynamic error statistics, it seems likely that FPGAs will not be a useful option in the long term. One should also note that, although genetic algorithms are very powerful within their own range of applications [17], outside this range they exhibit quite rapid fall-off in efficiency [57]. Once trained, neural networks are often hard to interpret, in the sense that the patterns of weights and connections which they have developed during training are usually not simple mappings of what they have learned to recognise. To give an extreme example, it is possible for researchers to recognise, at a very coarse level of spatial resolution, the components of the human brain which are involved in recognising one’s grandmother: the details of which neurons contain the necessary information, and just what that information is, are quite unknown. An improved understanding of just how a particular neural network has stored the information which it has ‘learned’ is desirable, because then the system behaviour can be better understood and analysed [63]. Finally, we mention fuzzy logic systems. These have mostly been used in the past for small-scale control systems, but they have now developed to the stage where they can be used in large-scale expert systems (e.g. [53]). In order to keep the number of tasks in the ANSWERS project under control, it is currently intended not to investigate fuzzy logic systems.

21

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Section 4 : Hardware dependence: implementing algorithms with RTDs, SETs, QCAs and other devices 4.1.

RTDs

We have seen in section 2 that there is a fundamental constraint, due to signal propagation delays, on the speed at which a given problem can be solved using synchronous logic, and possibly asynchronous logic as well. On the other hand, we may interpret this in a different way, and ask what problems could be solved, given a digital logic device which is capable of running at a given high clock speed. RTDs are such high-speed devices, having been demonstrated to run at 66 GHz [70]. What useful algorithms could be run at such clock rates, given the limitation imposed by expression 2.19 (or, more accurately, by its equivalent formulation for RTDs)? A detailed analysis has yet to be carried out but it is probable that by using pipelining, two 16-digit (i.e. 16 by 8-bit) numbers can be latched, added, and fed to an output register at effective clock rates up to 35 GHz ( ± 50%). Such high-speed operations are desirable for correlators in digital signal processing, particularly in a military context, but also for high-speed database searching (cf. e.g. [9]). RTDs are already being applied for this purpose in the USA [83]. Another interesting interpretation can be drawn from the clock speed limit discussed in section 2. An area where it would be useful to operate at high speed on pairs of numbers is in the factorisation of large numbers in public key cryptography (cf. e.g. [29], [56]). An analysis of the maximum speed at which two numbers can be divided has not yet been carried out, but it seems that it should be possible to carry out pipelined multiplication of two 130-digit numbers (i.e. Lword = 300 bits) at rates approaching 20-30 GHz, using RTDs. The number of devices in an L-bit multiplier will be approximately 8 L2, or 720,000. Suppose that the effective area of each device is A = D2 = (αf)2 , where f is the minimum feature size. Suppose further that a factor of 2 is needed for supporting circuitry, that α = 6, and that f = 100 nm. A 15 mm by 15 mm chip would then hold approximately 450 multipliers and could carry out ~ 1013 multiplies per second. The number field sieve algorithm [48] would need of the order of exp(2L1/3(ln(L))2/3) ~ 3.5x 1018 operations, or about 100 hours, to obtain an answer. One hundred such chips operating in parallel would take about one hour. This is approximately 7 times faster than the estimated performance of a quantum computer carrying out the same task [61].

4.2.

SETs

Although there have been recent descriptions of self-assembled SETs and related devices ([15], see also [36]), it will be assumed for the present that assemblies of SET devices will be built using conventional semiconductor fabrication technology. Because of the close link between the size of an SET and its maximum operating temperature, the implication is that unless the critical features of SETs can be made extremely small, SET systems may have to run at low temperatures. Other problems with SETs is that they are sensitive to buried charge, that they have low fan-out, and that they have relatively low clock rates (cf. e.g [25]). The problems of buried charge sensitivity are being addressed by TU Delft as part of the ANSWERS project; it is UCL’s intention to examine the tradeoffs between speed, packing density and algorithm performance in the next six months.

22

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 To do this it will be necessary to choose one or more algorithms and architectures as test vehicles for analysis. At present it seems to be almost certain that SETs will have relatively high error rates, and it is therefore more appropriate to use neural network algorithms as a basis for investigation. It is provisionally intended to use multilayer perceptrons (MLPs) and self-organising maps (SOMs) as the target architectures. Figures 6 and 7 illustrate the principles of the SOM algorithm. This algorithm is easy to implement on a conventional serial computer, and special-purpose digital hardware has also been developed for it. However, it would be a challenging task to provide a probabilistic/analogue implementation using nanoelectronic devices.

Fig. 6 Illustration of the contents of a self-organising map, which is being trained to distinguish between handwritten zeros. Each of 49 cells contains a 1024-byte ‘memory’ vector. At the start of the training period the vector contents contain only random noise. As training proceeds the contents of the cells adjust to give a representation of the relative probability of different patterns in the training set. Left-hand image: early in training. Right-hand image: near end of training.

Fig. 7 One of many training patterns is correlated with the data stored in each cell of the SOM. The cell with the best match (the ‘winner’, here the cell with a partial ‘A’ as its contents) has its contents adjusted to be nearer the current pattern. In addition, cells in the neighbourhood of the winner are also updated. The process is repeated with many different training patterns. The neighbourhood size is initially very large, but is slowly reduced over the training period to a 3-by-3 cell window. The algorithm is easy to write for a serial machine, and would be relatively easy to implement on a parallel array, with a digital processor and local memory for each cell, were it not for the variable neighbourhood size. Implementing this algorithm in an analogue/probabilistic form with nanoelectronic devices will be very challenging. Animals and humans implement this algorithm, or something quite similar, almost every day.

23

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

4.3.

QCAs

Work on possible QCA architectures has already been carried out at UCL ([74],[75]) using a spreadsheet simulator developed at UCL, and based on results from the AQUINAS simulator developed by Lent and Tougaw in U. Notre Dame. The latter simulator was found to have some limitations, and it is understood to be in the process of a partial re-write. The main simulator being developed at Pisa will use a slightly different approach, but it is expected that the UCL spreadsheet simulator can be modified to handle the different assumptions made in the Pisa model. The UCL simulator is based on the assumption that it may eventually be possible to fabricate assemblies of QCAs to implement ‘conventional’ digital logic. Of course, this is an optimistic assumption, but this approach has already thrown up interesting questions related to clocking speeds which would perhaps otherwise not have been noticed for some time. It is therefore intended to continue the emphasis on multiphase digital clocked logic for the time being. Figure 8 gives an example of how QCA elements can in principle be combined to make the basic components of digital logic circuitry. Larger circuit elements are being designed ([74],[75],[76]; see also recent work by P. Kogge [65]). Binary Wire

Corner

Majority Gate

M

Fan out

Cross Over

AND gate

Inverter

Corner NOT

NAND gate

Fig. 8 This figure shows how QCA cells could be combined to produce the basic logic elements needed to implement digital logic circuits of arbitrary complexity ([74],[75],[76]). ‘SQUARES’ stands for Standard Quantum cellular automata Array Elements.

It should also be noted that a proposal has recently been made for a QCA-like architecture which is intended to implement quantum computing ([6]; see e.g. [61] for a recent review of quantum computing). The QCA structures which have so far been considered use quantum processes for their operation, but are intended to implement conventional digital logic processes. It is not clear at present whether the algorithmic and architectural implications of reference [6] can be investigated in the ANSWERS project, but it appears probable that, whatever the possible implications for performance which the use of quantum computing circuitry may offer over classical computing paradigms, the structures proposed in [6] will face exactly the same problems as QCAs in terms of fabrication, noise tolerance, multiphase clocking and circuit analysis requirements. Another proposed design for a quantum computer [34] is even more demanding in its technical requirements.

24

12/02/99

ANSWERS REPORT: ANSW_1/FEB99

Section 5 : Summary and Conclusions This document has outlined some of the factors which are likely to affect the performance of future nanoscale computers, with particular emphasis on the effect of choosing particular architectures and algorithms. In view of the existing dominance of conventional digital logic, and the likelihood of such logic continuing to be needed for an indefinite period, much of the document has been concerned with defining the constraints on digital logic, independent of the devices used to implement that logic. Analogue/probabilistic systems, of which some neural network types are the main examples, have been considered to a lesser extent. However, given that it is highly probable that nanoscale systems will have high error rates unless extensive fault tolerance is provided, it is proposed to use SETs as neural network testbeds for exploring fault tolerant architectures. The topics for further investigation will include: •

Refinement of signal propagation model outlined in section 2.1;

•

Estimation of component density and speed of digital logic circuits and memory using RTDs instead of CMOS;

•

Circuit designs for probabilistic/analogue circuits using SETs and RTDs;

•

Effects of fault-tolerant circuit design on performance of digital and probabilistic/analogue circuits using arbitrary devices;

•

Completion of QCA circuit element design;

•

Preliminary work on possible probabilistic use of QCAs.

The main conclusions of this interim document are that signal propagation delays and device errors will be extremely important in determining the performance of future systems, whatever the nanoelectronic device, whatever the algorithm, and whatever the architecture. It is UCL’s intention to investigate these constraints within the context of the ANSWERS collaboration.

References [1]

C. Akrout et al., “A 480-MHz RISC microprocessor in a 0.12 Leff CMOS technology with copper interconnects”, IEEE Journ. Solid State Circ. 33, 1609-1615, 1998

[2]

M.G. Ancona, “Bit errors in single-electron digital circuits”, Proc. Int. Workshop on Quantum Functional Devices, NIST Gaithersburg VA, Nov 1997.

[3]

M. Akazawa, “Binary-decision-diagram logic systems using single-electron circuits and single-flux-quantum circuits” preprint 1998 ([email protected])

[4]

S.F. Al-sawari, D. Abbott & P.D. Frenzon, “A review of 3-D packaging technology”, IEEE Trans. Compon.Pack.& Manuf. Technol. Part B, 21, 2-14, 1998

[5]

D. W. Bailey & B.J. Benschneider, “Clocking design and analysis for a 600-MHz Alpha microprocessor”, IEEE Journ. Solid State Circ. 33, 1627-1633, 1998

[6]

S.C.Benjamin & N.F. Johnson, “Cellular structures for computation in the quantum regime” , Oxford Centre for Quantum Computing reprint 1998 (www.qbit.org/ research/Nano)

[7]

W.P. Burleson, M. Ciesielski, F. Klass & W. Lu, “Wave-pipelining: a tutorial and research survey”, IEEE Trans. VLSI Syst. 6, 464-473, 1998

25

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 [8]

F. Chen & D. Gardner, “Influence of line dimensions on the resistance of Cu interconnections”, IEEE Electron Device Lett. 19, 508-510, 1998

[9]

T. Chen (ed) “Highlights of statistical signal and array processing”, IEEE Signal Proc. Mag. September 1998, pp 21-63.

[10]

J. C. da Costa, J. Hoekstra, M. Goossens, C.J.M. Verhoeven & A.H.M. Van Roermund, “Considerations about nanoelectronic GSI processors”, to appear in Analog integrated circuits and signal processing: special issue: Circuit design in nano and quantum device technology, 1999-01-26

[11]

D. Crawley, “An analysis of MIMD processor node designs for nanoelectronic systems”, UCL IPG Report 97/3, available via http://ipga.phys.ucl.ac.uk/reports/html

[12]

D. Crawley, “An analysis of MIMD processor interconnection networks for nanoelectronic systems”, UCL IPG Report 98/3, available via http://ipga.phys.ucl.ac.uk/reports/html

[13]

T. Dracopoulos, Evolutionary learning Algorithms for Neural Adaptive Control, Springer-Verlag 1997

[14]

M.J.B. Duff & T.J. Fountain (eds), Cellular Logic Image Processing, Academic Press, 1986.

[15]

L. Feldheim & C.D. Keating, “Self-assembly of single electron transistors and related devices”, Chem. Soc. Rev. 27, 1-12, 1998

[16]

D. Floreano & F. Mondada, “Evolutionary neurocontrollers for autonomous mobile robots”, Neural Networks 11, 1461-1478, 1998.

[17]

D.B. Fogel (ed), ‘Evolutionary Computation: The Fossil Record’, IEEE Press, 1998 (ISBN 0-7803-3481-7)

[18]

T.J. Fountain, Parallel Computing: Principles and Practice, Cambridge University Press, 1994.

[19]

T.J. Fountain & L. Passmore, “3D image processing architectures”, UCL IPG interim report 1998

[20]

FPGA-based Computing Machines website 1998: http://www.io.com/~guccione/HW_list.html

[21]

A Fujiwara et al., ‘Silicon double-island single-electron device’, pp 163-167, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[22]

K.-I. Goto et al., ‘A high-performance 50 nm PMOSFET….’, pp 471-475, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[23]

P.E. Gronowski, W.J. Bowhill, R.P. Preston, M.k. Gowan & R.L. Allmon, “High-performance microsprocessor design”, IEEE Journ. Solid-State Circ. 33, 676-686, 1998

[24]

R.Gupta, J. Willis & L. Pileggi, “Analytic termination metrics for pin-to-pin lossy transmission lines with nonlinear drivers”, IEEE Trans. VLSI Syst. 6, 457-463, 1998

[25]

P. Hadley & J.E. Mooij, “Quantum nanocircuits: chips of the future?”, TU Delft reprint, 1999 (http://vortex.tn.delft.nl/~hadley/publications/qdevices99/pearsal.html).

[26]

J. Hastad, Computational Limitations of Small-Depth Circuits, MIT Press, 1982

[27]

S. Haykin, Neural Networks, Macmillan, NY, 1994

[28]

J.R. Heath, P.J. Kuekes, G.S. Snider & R.S.Williams, “A defect-tolerant computer architecture: opportunities for nanotechnology”,Science 280, 1716-1721, 1998

[29]

M.E. Hellman, “The mathematics of public key cryptography”, Sci. Am. 241,130-139, 1979

[30]

J.L. Hennessy & D.A. Patterson, Computer architecture: a quantitative approach , Morgan Kaufman, San Mateo, Cal, USA, 1990

[31]

W.D. Hillis, The Connection Machine, MIT Press, Cambridge, Mass., 1985.

[32]

D.S. Hochbaum (ed), Approximation algorithms for NP-hard problems, PWS Publishing, Boston, MA, 1997

26

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 [33]

V.K. Jain & S. Horiguchi, “VLSI considerations for TESH: a new hierarchical interconnection network for 3D integration”, IEEE Tran. VLSI Syst. 6, 346-353, 1998.

[34]

B.E. Kane, “A silicon-based nuclear spin quantum computer”, Nature 393, 133-137, 1998.

[35]

P.Keskinocak, “On-line algorithms: how much is it worth to know the future?”, IBM Research Report RC 21340 (TJ Watson Res. Center), Nov. 1998.

[36]

C.J. Kiely, J. Fink, M. Brust, D. Bethell & D.J. Schiffrin, “Spontaneous ordering of bimodal assemblies of nanoscopic gold clusters”, Nature 396, 444-446, 1998

[37]

T. Kobayashi et al., ‘A 0.24 µm2 cell process with 0.18 µm width …..’, pp 275-279, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[38]

A. Koga et al., ‘Two-dimensional borderless contact pad technology for a 0.135µm2 4-Gigabit DRAM cell’, pp 25-28, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[39]

Kohonen, Self-organising maps, Springer-Verlag, Berlin, 1995.

[40]

I. Koren & Z. Koren, “Defect tolerance in VLSI circuits: techniques and yield analysis”, Proc. IEEE 86, 1819-1836, 1998

[41]

M. Korkin, H.de Garis, F. Gers http://www.genobyte.com/GP97full.html

[42]

J. Lach, W.H. Mangione-Smith & M. Potkonjak, “Low overhead fault-tolerant FGPA systems”, IEEE Trans. VLSI Syst. 6, 212-221, 1998

[43]

K. Likharev, “Development of http://rsfq1.physics.sunysb.edu/~likharev

[44]

K. Likharev, “Single-electron parametron: reversible computation in a discrete-state system”, Science 273, 763-765, 1996

[45]

D. Mange, E. Sanchez, A. Stauffer, G. Tempesti, P. Marchal & C. Piguet, “Ekmbryonic: a new methodology for designing field-programmable gate arrays with self-repair and self-replicating properties”, IEEE Trans. VLSI Syst. 6, 387-399, 1998

[46]

D. Matzke, “Will physical scalability sabotage performance gains?”, IEEE Computer 30, 4, 37-39, 1997.

[47]

C. Mead & L. Conway, Introduction to VLSI systems, Addison-Wesley 1980

[48]

A.J. Menezes, P.C. van Oorschot & S.A. Vanstone (eds), Handbook of Applied Cryptography, CRC Press, Boca Raton, Fl., 1997

[49]

MELARI Nanowire Project report 1997, work package 1: electron transport in nanowires (http://www.cismi.dk/main/RunProj/NANO97_98/task1_a.htm)

[50]

Nakamura et al., ‘A simple 4 G-bit DRAM technology ….’, pp 29-32, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[51]

H. Ohnishi, Y. Kondo & K. Takayanagi, “Quantized conductance through individual rows of suspenede gold atoms”, Nature 395, 780-784, 1998

[52]

C. Pacha, U. Auer, P. Glkosekotter, A. Brennemann, W. Prost, F.-J. Tegude & K. Goser, “Resonant tunneling transistors for threshold logic circuit applications”, to appear in GLSVLSI99 (Great Lakes Symposium on VLSI, March 1999)

[53]

K. Pan, G.N. DeSouza & A.C. Kak, “FuzzyShell: A Large-Scale Expert System Shell Using Fuzzy Logic for Uncertainty Reasoning”, IEEE Trans. Fuzzy Syst. 6, 563-581, 1998.

[54]

N. Weste & Kamran Eshragian, Principles of CMOS VLSI design, Addison-Wesley, 1985

[55]

B.D. Ripley, Pattern recognition and neural networks, Cambridge University Press, 1996.

&

H.

Hemmi,

new

“CBM

10-nm-scale

27

(CAM_BRAIN

active

MACHINE,

devices”,

1997

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 [56]

Rivest, A. Shamir & L. Adleman, “On digital signatures and public-key cryptosystems”, MIT CS Lab. Rept.MIT/LCS/TR-212, 1979

[57]

R. Salomon, “Evolutionary Algorithms and Gradient Search: Similarities and Differences”, IEEE Tran. Evol. Comput. 2, 45-55, 1998.

[58]

J-I. Shirakashi et al., ‘Room temperature Nb/Nb oxide-based single-electron transistors’, pp 175-178, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997

[59]

J. Silbermann et al., “A 1.0-GHz single-issue 64-bit PowerPC integer processor”, IEEE Journ. Sol. State Circ. 33, 1600-1607, 1998

[60]

Spagocci, T.J. Fountain & D. Crawley, “ A model for chip error rate”, Internal report, Image Proc. Group, UCL, 1998, available online from http://ipga.phys.ucl.ac.uk/reports/html)

[61]

A. Steane, “Quantum computing”, Rep. Prog. Phys. 61, 117-173, 1998

[62]

N. Takahashi, N. Senba, Y. Shimada, I. Morisaki & K. Tokuno, “Three-dimensional memory module”, IEEE Trans. Compon.Pack.& Manuf. Technol. Part B, 21, 15-19, 1998

[63]

A.B. Tickle, R. Andrews, M. Golea & J. Diederich, “The Truth Will Come to Light: Directions and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural Networks”, IEEE Trans. Neur. Netw. 9, 1057-1068, 1998.

[64]

T.N. Todorov, “Calculation of the residual resistivity of three-dimensional quantum wires”, Phys. Rev. B54, 5801-5813, 1996.

[65]

ULTRA Electronics Review Meeting Proceedings, Estes Park , CO, 18-21 Oct 1998

[66]

J.Villasenor & B. Hutchings, “The flexibility of configurable computing”, IEEE Signal Proc. Mag. September 1998, pp 67-84.

[67]

E. Waingold et al., “Baring it all to software: the raw machine”, MIT Lab for Computer Sci. Report TR-709, Mar. 1997

[68]

H. Wakabayashi et al., ‘A high-performance 0.1µm CMOS with….’, pp 99-102, Technical Digest, Int. Electron Devices Meeting (IEDM97), IEEE Inc., 445 Hoes Lane, Piscataway NJ 08855, USA, 1997.

[69]

M. Yasunaga, I. Hachiya, K. Moki & J.H. Kim, “Fault-tolerant self-organising map implemented by waferscale integration”, IEEE Trans. VLSI Syst. 6, 257-265, 1998

[70]

T.P.E. Broekart, B. Brar, F. Morris, A.C. Seabaugh & G. Frazier, “Resonant tunneling technology for mixed signal and digital circuits in the multi-GHz domain”, invited paper, 9th Great Lakes Symposium on VLSI, March 4-6, Ann Arbor Michigan.

[71]

D.V. Averin & K.K. Likharev, “Possible applications of the single-charge tunnelling”, pp311-332, Single Charge Tunneling, H. Grabert & M.H. Devoret (eds), Plenum Press, 1992.

[72]

P.M.B. Vitanyi, “Locality, communication, and interconnect length in multicomputers”, MIT Lab. For Computer Science reprint August 1987.

[73]

T. Toffoli & N. Margolus, Cellular automaton machines, MIT Press, Cambridge, MA, 1985.

[74]

D. Berzon & T.J. Fountain, “Computer memory structures using QCAs”, UCL IPG Report 98/1, 1998 (available via http://ipga.phys.ucl.ac.uk/reports)

[75]

D. Berzon & T.J. Fountain, “A systematic approach to QCA circuit designs”, talk given at DARPA Ultra Electronics Review, Estes Park, CO, Oct 1998

[76]

D. Berzon & T.J. Fountain, “A memory design in QCAs using the SQUARES formalism”, to appear in GLSVLSI99 (Great Lakes Symposium on VLSI, March 1999)

[77]

A.B. Kahng & S. Muddu, “Delay analysis of VLSI interconnections using the diffusion equation model”, UCLA CS Dept. Rept. 1995 ([email protected], [email protected])

28

12/02/99

ANSWERS REPORT: ANSW_1/FEB99 [78]

E.G. Friedman, High performance clock networks, Kluwer Academic, Boston 1997

[79]

J.C. Whitaker, The Electronics Handbook, CRC/IEEE Press, 1996

[80]

N.T. Jarwala & D.K. Pradhan, “TRAM: a design methodology for high-performance, easily testable, multibit RAMs”, IEEE Trans. Comput. 37, 1235-1250, 1988

[81]

C. Mead & M. Rem, ‘Minimum propagation delays in VLSI’, IEEE J. Solid-State Circ. SC-17, 773-775, 1981.

[82]

J.F. Ziegler, M.E. Nelson, J.D. Shell, J. Peterson, C.J. Gelderloos, H.P. Muhlfeld & C.J. Montrose, “Cosmic ray soft error rates of 16-Mb DRAM memory chips”, IEEE J. Solid. State. Circ. 33, 246-251, 1998

[83]

P. Mazumder, S. Kulkarni, M. Bhattacharya, J. P. Sun & G.I. Hadda, “Digital circuit applications of resonant tunneling devices”, Proc. IEEE 86, 665-686, 1998

[84]

S. Hauck, “The role of FPGAs in reprogrammable systems”, Proc. IEEE 86, 615-638, 1998

[85]

D.K. Schroder, “New life in detecting defects”, Circuits & Devices, pp 14-20, Nov. 1998

[86]

B. Ciciani, Manufacturing yield evaluation of VLSI/WSI systems, IEEE Comp. Soc. Press, CA, 1994

[87]

G.J. Jeong & M.K. Lee, “Design of a scalable pipelined RAM system”, IEEE J. Solid State Circ. 33, 910914, 1998

[88]

W. Poundstone, The recursive universe , Oxford University Press, 1985.

[89]

Y. Taur & E.J. Nowak, “CMOS devices below 0.1 µm: how high will performance go?”, 215-218, IEDM 97 Technical Digest.

29

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close