The concept of an ACEx® cost-effective first level surface detector trigger in the Pierre Auger Observatory

Z. Szadkowski*1

Physics Department, Bergische Universität Wuppertal, 42097 Wuppertal, Germany

Received 5 April 2005; received in revised form 15 June 2005; accepted 15 June 2005
Available online 12 July 2005

Abstract

The paper describes the new design of the first level trigger for the surface array in the Pierre Auger Observatory. The previous design was tested in a small test segment called Engineering Array (EA). It confirmed full functionality and reliability of the PLD approach. However, because of the high price of the chips available at that time, a new cost-effective design was developed. Altera® offered cost-effective family, which allows reducing the total budget of the electronics without compromise in the functionality. The here described concept of a splitting of data processing into two sub-channels implemented into the parallel working chips, the chips synchronization and the automatization of internal processing management, together with the fully pipelined AHDL code became the framework for the further implementation in the environmental condition of the Argentinian pampa.

© 2005 Elsevier B.V. All rights reserved.

PACS: 29.40; 85.40; 95.30.—k

Keywords: Pierre Auger Observatory; Triggers; PLD

1. Introduction

The Pierre Auger Observatory is the world’s largest surface array detector, investigating the highest energies cosmic rays. It will consist of two sites: the Southern (in Argentina, under construction since 2000) and Northern site (USA, foreseen to start in 2006). Each site will contain 1600 surface detector stations distributed over 3000 km². The surface detectors are water Cherenkov tanks, equipped with three PMTs and low-power electronics, supplied by a solar panel, and transferring data via a custom wireless network [1]. The situation dictates special requirements to the Surface Detector (SD) trigger system. The system must operate with a limited solar power supply, the variation of the environmental temperature...
The preliminary phase of the Pierre Auger Observatory was the EA containing 40 Cherenkov detectors, built to verify and improve several variants of the hardware and software designs [2–4].

In each tank three PMTs convert the Cherenkov light into electrical signals. The signals from the anodes (low-gain channel) and dynodes (high-gain channel) are read out by a Front End Board (FEB). Splitting the signals allows the extension of the measured energy range to 15 bits with 5 bits overlapping. The system digitizes all analog signals at 40 MHz in AD9203 Flash ADCs, filtered by an anti-aliasing 5-pole Bessel filter with a cut-off at 20 MHz. All digital signals are then processed by the programming logic devices (PLD) Altera® chips, supported by an additional dual-port random access memory (DPRAM) as a temporary buffer [5]. The main task of the PLD chips is the hardware generation of first level triggers, from the high-gain channel. Some part of data stream is temporarily stored in internal or external buffers in order to investigate the structure of the input data. The calibration is chosen to provide a relatively high event rate, too high to be transmitted via the communication link with a relatively narrow bandwidth. The rate is next reduced by software in the second level trigger implemented in the Station Controller (as the part of the Unified Board (UB), see Fig. 1).

The PLD approach provides well for long-term stability even with the strong daily temperature variation. Additionally, and in contrast to an ASIC design, it allows for a modification and/or an optimization of the trigger algorithms and the code even in extreme environmental conditions.

2. APEX™—Engineering Array phase

All the detectors on the EA were equipped with EP20k200RI240-2 chips from Altera® APEX™ 20K family. This would have been a good solution for a surface detector trigger (large capacity of resources, fast register performance) [6] but because of the high price, other cost-effective designs were developed. The Altera® ACEX® family allowed the possibility to significantly reduce the total price, without reducing any functionality. The experience gained with the EA was applied to the pre-production phase in order to improve the quality of the data and the reliability of the system.

3. ACEX®—chip selection

Data from the PMTs are continuously monitored in two channels: "fast"—to record the shower profile; and "slow"—used for the self-calibration utilizing the through-going muons flux [7].

The biggest ACEX® chip contains only 48 kbit of internal RAM in Embedded Array Blocks (EAB). This only allows the implementation of $2 \times 32\text{bit} \times 768$ words, exactly enough for two fast buffers, but with 32-bit width only. To implement a full length (64-bit) bus, two chips are required. The fast channel had to be split into two sub-channels: high-gain and low-gain, processing signals from the dynodes and anodes of the PMTs, respectively [8].

The slow channel was not used in the EA phase due to some manufacturing defects. Full operation began in the pre-production phase. ACEX® chips contain sufficient logic resources to allow the
implementation of the “slow channel algorithm”, however the implementation of two slow buffers requires the support of external memory. To ensure sufficiently large buffers, and to simplify the design, a 16K × 36-bit Dual-Port RAM was chosen. The one way data flow (writing to the left port, independently and simultaneously reading from the right port) allows the separation of processes, speeding up the system and reducing dead time.

The Dual-Port RAM actually operates in a FIFO mode. However, dedicated FIFO chips were significantly more expensive for the same capacity.

The structure of the internal routines is fully pipelined. All processes are divided into sub-processes, performed in a single clock cycle. Some routines (mainly wide-bus adders) are relatively complex and for reliable, high-speed performance require two clock cycles. To synchronize data flow, adjacent single time-bin processes are delayed in the parallel simple DFF chain (CD and EF subroutines in Fig. 2).

4. The fast channel

The fast channel records a signal profile of the waveforms generated by cosmic ray showers. The event structure is determined in 256 time bins before a trigger and 512 time bins after the trigger. Data coming from the FADC are circulating in a 256-word length buffer until a trigger occurs. Afterwards, they are frozen and the subsequent data are registered in the next 512-word buffer. When the first event is completed, the system switches the data stream to the similar adjacent buffer, while the data in the previous one is waiting for DMA transfer. The micro-controller is requested via an interrupt.

One of the most critical parts of the trigger/memory system (because of the requirement of high-frequency implementation, with a sufficient safety margin) is the switching of buffers for temporarily storing data from FADCs. Buffers are located in Embedded System Blocks (ESBs), which can be organized with specific lengths $256 \times k$ ($k = 1, 2, 4, 8, \text{etc.}$). A direct implementation of a 768-words circulating buffer is unfortunately impossible. The compiler automatically locks the 1024 words into memory and there is no access to the last 256 words. The organization of ESBs had to be more efficient. The simplest way was to divide the buffers into circulating ($k = 1$) and fixed ($k = 2$) parts and afterwards to realign pointers to get a chronological reading. Only the circulating buffers (with some part of memory locked) would require more memory than available in the chips (Fig. 3).

Data from the last event, stored in the first buffer, can be transferred via the DMA, while the new data from the next event may be simultaneously recorded in the adjacent second buffer. Data recording of interesting events, selected by

![Fig. 2. Global pipeline internal structure of both chips (similarly implemented in the Second Level Trigger in the Fluorescence Detector [9] and in the earlier Engineering Array APEX design [3]). In the current design, routines CD and EF contain processes performed in two clock cycles (speed optimization). External trigger (Ext_Trig) and high-gain channel (ADC[29..0]) are connected to both: master and slave chips however, the low-gain channel (ADC[59..30]) to the slave only. ~EVTCLKF and ~EVTCLKS are the interruption signals requesting the DMA transfer. Power PC micro-controller [10] controls the DMA transfer and sends the following signals: ~DMAAAF and ~DMAAS (DMA acknowledge for the fast/slow channel respectively) as well as ~DMAXFER (bus request—common for both channels).]
various types of trigger, is independent of DMA data transfer to the main general-purpose Power-PC micro-controller, located on the UB (Fig. 4).

The algorithm for signal processing at 40 MHz and simultaneous data transfer to the external micro-controller was developed with a sufficient safety margin for the environment of the experiment in the field. This environment subjects the apparatus to large temperature variation, which had to be anticipated during development. This necessitated the design of complex functions in pipeline stages, with precise control of data flow. These functions included comparators, coincidence logic, adders, multiplexers, shift registers, counters, internal memories (implemented as dual-port RAM) and DMA glue logic. The internal EABs were simply programmed as FIFOs; however, dual-port implementation ensures a higher register performance.

The master chip utilizes ~3800 logic cells (LC), the slave ~3000 LC. The occupancies of chips are on the level 76% and 60%, respectively. The registered performance is 72 and 80 MHz and includes sufficient safety margin for a wide temperature variation (Fig. 5).

4.1. Synchronization of sub-channels

The trigger from ACEX_A is transferred into ACEX_B to write data synchronously in the low-gain sub-channel (Fig. 1). The master trigger is generated inside the ACEX_A chip and transferred via dedicated Fast Input/Output Registers located in IOC. The DFF in the ACEX_A I/O cell is redundant in order to minimize internal
propagation time. The number of input lines connected to the I/O cell is limited to four, equal to the number of inputs in the Look Up Table. More inputs would significantly increase the propagation time.

A 25 ns clock cycle improves the safety margin from 9 ns (for a direct connections without dedicated chain) to 14 ns. That time (the propagation of the trigger between pins of chips) has been confirmed (as sufficient) in temperature tests covering the range -20 to +70°C.

4.2. Triggers

Triggers prepared in the high-gain sub-channel are transferred synchronously to the slave chip. The master ACEX chip also controls enabling/disabling of tri-state buffers during DMA transfers, in both master and slave chips. Some processes are relative slow; to speed up the global registered performance, adders have been implemented as pipelined routines requiring two clock cycles. The analysis of a lot of variants suggests that the optimization of the design done by the compiler is insufficient. In such cases it is recommended to manually optimize the internal structure interactively with the compiler. If processes related to the same clock cycle are spread throughout the whole chip, the optimization is poor. They should be located as close as possible to reduce propagation times. Grouping processes in functional blocks (rather than in pipelined, single clock dependent routines) significantly reduces register performance of the system (Fig. 6).

The trigger processing uses two main types of algorithms: single bin and Time over Threshold (ToT) triggers, with additional triggers mainly for diagnostics [4]. The events registered in the Cherenkov tanks result from showers at various distances and different energies. To record data for a wide class of showers, triggers have to take into account both: the wide time dispersion and the amplitude range. The implemented algorithms allow dynamic tuning of the thresholds for various coincidence levels: for single 10-bit busses corresponding to the ADCs, as well as for the sum of signals from all ADCs. The Time over Threshold algorithm produces a ToT trigger if, a predefined number of fired time bins (where signal is above the fixed threshold) in a sliding time window are over the occupancy threshold. Single time-bin sub-triggers drive the 256-bit shift register, with outputs being connected to hierarchical multiplexers, selecting the length of the tracking window within the range (0–255). A ToT trigger is generated from multiple sub-triggers spread in time with a delay in comparison to the single bin trigger. ToT is finally generated in the K_bin_Fast subroutine and has to be merged with the other triggers in an “earlier” G_bin_Fast subroutine.

4.3. Internal memories and automatization of buffer switching

The fast channel is double-buffered; when a trigger occurs, the shower profile is recorded in the

![Fig. 6. Data from both chips are transferred in non-interlaced mode; first all data from ACEX_A, second from ACEX_B. Shorter MUX chain (in comparison with APEX design) requires only 4 pulses for data shift (data to be available for the DMA transfer).](image)
first buffer and the buffers are automatically switched to monitor the subsequent data. The full buffer waits for the DMA transfer. Such a double buffered approach reduces the dead time of the system. The automatic buffer switching assures full speed processing and reduces the risk related to manual flag manipulations, responsible for buffer selection. The set of flags, indicating the type of the trigger, is available automatically for the last registered event. The Station Controller does not have to check any control bits to select the appropriate buffer and flags set.

The goal of the first sub-buffer is to permanently monitor incoming data and freeze them when a trigger appears. The second buffer then receives data after a trigger and allows the completion of a full profile. Pointers in ring buffers are automatically adjusted to assure data transfer in a chronological order. If the next trigger appears just after the DMA, when the adjacent buffer is still full and data are waiting for a transfer, the time needed to operate a circulating buffer may be too short. A part of data may correspond to the previous event. To eliminate this data integrity violation, triggers are disabled for 6.4 µs after DMA, but only if the adjacent buffer is not empty (see Fig. 7).

4.4. Read-out systems

Classical microprocessor based tri-state gates, connected to the common internal data bus, produced glitches due to overloading and were replaced by pipelined multiplexers (see Fig. 6).

The interlaced mode for DMA transfer, used in the previous APEX design, required 5 empty reads to prepare internal data for the transfer through the pipelined multiplexers. In the ACEX design, the configuration of the two chips require a more natural, non-interlaced DMA mode. Data from the high-gain channel (ACEX_A) are transferred in the first 768-word block. The subsequent data from the low-gain channel (ACEX_B) are transferred in the second 768-word block. Such an approach needs two cycles of DMA initialization by the Power PC. Splitting of sub-channels reduces the length of the pipeline read-out system; only 4 empty reads are needed (Fig. 8).

After interruption, the Station controller needs to send only a single command in order to shift memory data for DMA transfer. Memory data are not shifted automatically. I/O registers merge the DMA and the configuration channels. After a single write before DMA, data from the registers must not be read because the first words in the memory data chain would be destroyed.

![Diagram](https://via.placeholder.com/150)

**Fig. 7.** When a trigger occurs, subsequent triggers are disabled for 19.2 µs to separate fully consecutive events and not to overlap AF/BC or BF/AC buffers, the third trigger should occur no earlier than 76.8 µs (38.4 µs in 0 WS mode) after the second one. Triggers are disabled for 6.4 µs after the DMA is completed, in order to allow filling of the AC/BC circulating buffers, otherwise data integrity may be violated.
The DMA algorithm also takes into account the refreshing of the dynamical memories in the Station Controller, which permanently interrupts the DMA transfer. This works perfectly with full speed (40 MHz) without any additional wait-states over the full temperature range\[11\] (Fig. 9).

4.5. \(\sim\text{DMADXFER vs. } \sim\text{CAS3}\)

The Engineering Array design was operated only in the 1 wait-state (WS) mode. The reduction of the wait state from 1 to 0 WS mode speeds up data transfer and reduces the dead-time by a factor of 2. The EA daughter Front End Boards (see Fig. 2 in [4]) did not support the \(\sim\text{DMADXFER}\) line, due to a lack of free pins in the PGA132 socket. The line \(\sim\text{DMADXFER}\), available in the ACEX design allowed the implementation of the...
algorithm supporting both 0 and 1 WS changing, even on the fly.

The propagation times of control lines have been additionally reduced by using dedicated inputs instead of common I/O pins. ACEX® chips contain four dedicated lines, which were used for ~DMADXFER, ~DMAAF and dynamic memory control lines ~CAS3 and ~DRAMWE [10]. The ~DMAAS for the slow channel was not a critical line and has been connected to the standard I/O (Fig. 10).

The 0 WS mode, although possible, has not been used. The 1 WS provides a greater safety margin, however only for the hold time. The internal PLL clock in the ACEX® chips unfortunately does not support precise shift tuning.

5. The slow channel

The slow channel was implemented in the EA phase [4], but due to assembling problems was not activated. In the current design, some modifications have been introduced to improve the flexibility of data management. Data recorded in the slow channel are zero-suppressed, collected in 20-word packets, prefaced by time stamps in each packet and stored in the 14-bit addressed external memory. A continuous memory area is divided into two slow buffers, switching for independent writing and DMA reading processes via left and right port, respectively. Buffer switching, flags setting (indicating the type of trigger), the management of internal data as well as the control of the external memory are full autonomous PLD tasks. The slow channel requires a simple 2-word shift only for initialization (Fig. 11).

5.1. Triggers in the slow channel

The structure of the trigger logic is the same as the single bin trigger in the fast channel. In the slow mode, when the trigger appears, the threshold can decrease. Recording data with a lower threshold allows the investigation of more detailed structure in weak signals, normally cut by the high threshold.

In the slow channel the length of recorded data depends on their values above the thresholds and cannot be anticipated. The length of the slow buffer is relatively long 8192 words (32-bits). If one of the buffers is full, ~EVTCLKS is sent to the Local Station as an interrupt, requiring a transfer of data from the external memory. However, the priority of the transfer is lower than for the fast channel and the slow transfer may be interrupted by the fast DMA. Data may be written to the memory via the left port independently and

![Fig. 11. The simulation showing the propagation of the trigger (#3FF) through the pipeline stages and an insertion of the time stamp (#1FD) four time bins before the trigger.](image-url)
simultaneously to the reading of data via the right port. From the available 36 bits, only 32 bits are used: 30 for high-gain ADC data and two as indicators of threshold and time stamps. To synchronize data with the fast channel, time stamps are inserted into the same stream of data and are written together with real data in the external memory. Time stamps are inserted at the beginning of real data blocks (unless skipped to avoid violation of data integrity) and are indicated by the 31st bit treated as the flag [7].

6. External memory control

The IDT70V3569 is a high-speed 16K × 36 bit synchronous Dual-Port RAM. The memory array utilizes Dual-Port memory cells to allow simultaneous access of any address from both ports. Registers on control, data, and address inputs provide minimal setup and hold times. The timing latitude provided by this approach allows systems to be designed with very short cycle times. With input data register, the IDT70V3569 has been optimized for applications having unidirectional or bidirectional data flow in bursts. The memory is operated in the address counter advanced mode. The internal address is increased (∼CNTEN = LOW [12]) on the rising edge of the controlled clock, provided by the PLD, regardless of all other memory control signals. That mode has been selected due to a lack of PLD pins allowing direct control to the external memory. An additional advantage is lower digital noise on the board. The slow data flow is one-way. The memory utilizes the global ADC clock for writing. However for reading, the right port clock is driven by the enabled clock-like pulses generated by the PLD. A direct global clock connection violates the timing in the DMA and must be replaced by additional PLD circuitry.

7. Power consumption

The surface detector is supplied by solar panels, so the power budget is one of the most critical parameters in the system. Total power consumption of the trigger should not be greater than 1W. Measurements confirmed estimations given by the ACEX Power Calculator available on the Altera web site. The power consumption of the ACEX design is 7–10% lower (with some dependence on the trigger rate) than in the APEX design and satisfies the limit of 1 W (according the specification) (Fig. 12).

8. Optimization

Conclusions drawn from the preliminary temperature and high-frequency tests [11] allowed the AHDL code to be optimized. ∼CAS3 has been replaced by ∼DMADXFER. The structure of the pipeline routines became more hierarchical in order to encapsulate the resources used and decrease propagation times. The Design Assistant attached to the Quartus® compiler (included since the version 3.0) allowed the correction of many paths not optimized earlier, significantly increasing register performance.

The next generation of the First Level Surface Detector Trigger using the Cyclone™ chips is also being developed at this time. Optimized solutions, recommended by the compiler for the Cyclone design, have also been incorporated into the ACEX design [13,14].
The FEB has been perfectly integrated with the UB (manufactured by College de France). The internal pattern generator has tested both fast and slow channels. The patterns for the 128 “fast” DMA transfers (128 \times 1536 \text{ words per 32 bits} = 768 \text{ kB}) and for the 32 “slow” DMA transfers (32 \times 8192 \text{ words per 32 bits} = 1 \text{ MB}) were running in the long-term loops in the Wuppertal climate chamber (in the full temperature range: \(-20 \text{ to } +70 \text{ °C}\)) for over 5 days. All data (as total of 47 GB) were transferred with zero wait-states without any bit errors (an error rate less than \(3 \times 10^{-12}\)). A temperature deviation of timing observed in the preliminary tests [11] was fully compensated by the code optimization. Even for extreme environmental conditions, the ACEX design is performing very stable.

9. Conclusions

The new ACEX family allowed significant reductions in the total cost of the trigger/memory circuitry as well as the power consumption. The automatization of the read-out system and the DMA transfer minimized the dead time. The functionality of the code as well as the stability and reliability of the inter-chip synchronization have been fully confirmed in the prototype implementation. The ACEX design described here has been approved as the framework for the further development and the practical implementation in all Cherenkov detectors [15]. The further optimization of the code, utilizing hints from the development of the next generation of the trigger based on the Cyclone family, allowed significant improvement in the registered performance and an increase in the safety margin of the entire system.

Acknowledgements

This work was supported by US Department of Energy, French Centre National de la RecherchÉ Scientifique and Polish Committee of Science. The author would like to thank Michigan Technological University for the access to the laboratory, where the design has been developed within January–June 2002, the Polish Committee of Science for the additional financial support (Grant no. 2 PO3D 011 24), College de France, where within July 2002–March 2003 the AHDL codes were being optimized during the integration of FEB with UB.

References