A 3x9 Gb/s Shared, All-Digital CDR for High-Speed, High-Density I/O
This paper appears in:
Solid-State Circuits, IEEE Journal of
Date of Publication: March 2012
Author(s): Loh, M.
California Inst. of Technol. (Caltech), Pasadena, CA, USA
Emami-Neyestanak, A.
Volume: 47, Issue: 3
On Page(s): 641 - 651
Product Type: Journals & Magazines
ABSTRACT
This paper presents a novel all-digital CDR scheme in 90 nm CMOS. Two independently adjustable clock phases are generated from a delay line calibrated to 2 UI. One clock phase is placed in the middle of the eye to recover the data (“data clock”) and the other is swept across the delay line (“search clock”). As the search clock is swept, its samples are compared against the data samples to generate eye information. This information is used to determine the best phase for data recovery. After placing the search clock at this phase, search and data functions are traded between clocks and eye monitoring repeats. By trading functions, infinite delay range is realized using only a calibrated delay line, instead of a PLL or DLL. Since each clock generates its own alignment information, mismatches in clock distribution can be tolerated. The scheme's generalized sampling and retiming architecture is used in an efficient sharing technique that reduces the number of clocks required, saving power and area in high-density interconnect. The shared CDR is implemented using static CMOS logic in a 90 nm bulk process, occupying 0.15 mm2. It operates from 6 to 9 Gb/s, and consumes 2.5 mW/Gb/s of power at 6 Gb/s and 3.8 mW/Gb/s at 9 Gb/s.
Comments:
Very little to dislike in this paper. It might turn out to be as interesting as Staszewski's all-digital PLL work!
The key advantage of an all-digital system is that while it may not be the most "beautiful" (think wine-glass in comparison with an earthenware pot) implementation around, once you've got it right, you can port it to any process with minimal work and be happy and sit on a beach with a pina-colada. (or perhaps, be rendered redundant!)
The CDR described in this paper is mainly for source-synchronous links. Stuff like shipping data from one core in a chip to another core. The frequency difference of the sending clock and the recieving clock is not expected to be huge so as to require very high band-width CDRs. Its for slow stuff.... you have multiple buses from one core to the other and while they are driven by the same PLL, the routing might have decent mismatches and you need a CDR to correct for that so that the Back-end digital does not have a fit closing timings (silly sounding but difficult to do in multi-core chip and processors like the "Cell" processor school kids are going ga-ga about...)
The idea is to take the source clock, run it through a delay chain which does not need to be accurate but generates enough phases with the resolution you require. Now take two clocks. Keep one at the centre of the eye-opening (data-clock) and make the other move about trying to find the most optimum sampling point (search-clock). Once it finds that, make it the data clock and let the first one move about. This does two things
- It calibrates out the mismatch between the two clocks
- If the eye begins to drift away, you can find the optimum sampling point and then INVERT THE CLOCK (this is a half-rate system) to move behind (or ahead) one UI
Interesting....
The rest is all easily figured out. Take the multiple samples, filter them... figure out the point where the search data differs with the sampled data. Do some fancy probability stuff to figure out how many samples you need to take to get the BER you need.
- The "search" data and the "sampled" data are compared in a low-rate manner
- The clock phase updates are made by a control block in a low rate manner
- Though the exact delay of the delay chain is unimportant, we do need to have some control on the delay to control the band-width. Too high and it becomes unstable. Too low and it tracks nothing and BER shoots up.
- While the routing of CLK and CLKZ are done digitally, they need to be carefully done. There is a comment in the paper about this.
Issues:
1. The delay chain will be very susceptible to supply noise. It is not described how that is combated. Combating it would probably not hurt power too much (unless you want to put an LDO to generate the supply) but would definitely cost area. It is unclear from the paper whether sufficient de-cap was put in though the micro-graph does show unnamed areas which could be de-coupling...
2. There are no details about jitter budgets. How much was given to the loop, how much to timing error etc.
Good paper overall. While the authors have been cautious in trying to pitch it for source synchronous interconnects, it might also work for serial link applications with only embedded clock. The band-width of the CDR is very low since the same control block adjusts clocks for many channels. If we speed up the control block and keep one dedicated for each channel, then things would get interesting and a comparison with a DLL based CDR for cable applications would be worth looking at.
Also, it is interesting that an eye-opening search scheme has been used instead of the tradition BB-PD loop. The BB-PD loop, coupled with the delay chain shown here and a digital filter would be very close to an all-digital CDR for plesiochronous links. This present scheme is good for multiple links in a source synchronous clocking scheme.
I have seen a 5Gbps VCO-based CDR which took ~20-30mA. This papers says that it would take something like 12.5mA for the same application without requiring too much analog work and would give something that will be easily portable between processes. It is likely that due to the higher band-width required, the power would go up but it might end up hitting 20-30mA which is still competitive.
Don't remember the comparable area number so cannot compare...
Really...a BB-PD with a delay loop would be an interesting all-digital CDR to compare with the above numbers.