Works

Various results and achievements by Sony's technological developments

Title Technology Category Name Link

[Academic Conference]The Society for Information Display(SID)
[Theme]Latency Compensation for Optical See-Through Head-Mounted with Scanned Display

The authors present the design and implementation of latency compensation techniques for an optical see-through head-mounted raster-scan display to realize augmented reality. The maximum registration error of 3D virtual objects results in 0.03 degrees on horizontal axis when the rolling motion of the user’s head is simulated.

more
Category Display & Visual
Name H. Aga,
A. Ishihara,
K. Kawasaki,
M. Nishibe,
S. Kohara,
T. Ohara,
M. Fukuchi
Learn More

The authors present the design and implementation of latency compensation techniques for an optical see-through head-mounted raster-scan display to realize augmented reality. The maximum registration error of 3D virtual objects results in 0.03 degrees on horizontal axis when the rolling motion of the user’s head is simulated.

[Academic Conference]International Conference on Nitride Semiconductor (ICNS)
[Theme]Watt-class 462 nm-Blue and 530 nm-Green Laser Diodes

In this research, watt‐class green and blue laser diodes, which are fabricated on free‐standing semipolar urn:x-wiley:14381656:media:pssa201700513:pssa201700513-math-0003 GaN and conventional c‐plane GaN substrates, respectively are developed. Although several research groups have recently developed green laser diodes on semipolar GaN substrates, which have weaker piezoelectric fields and higher indium homogeneity in InGaN active regions compared to c‐plane GaN, watt‐level output power has yet to be achieved. By utilizing the urn:x-wiley:14381656:media:pssa201700513:pssa201700513-math-0004 plane, the first watt‐class green lasers at 530 nm is successfully fabricated, and achieve maximum output powers in excess of 2 W, which to the best of our knowledge is the highest value reported for any GaN‐based green laser diode. A wall‐plug efficiency of 17.5% is realized at a current of 1.2 A under continuous‐wave operation, which corresponds to an optical output of approximately 1 W and is the highest value reported to date. In addition, high‐power and high‐efficiency blue laser diodes at 465 nm are successfully fabricated on conventional c‐plane GaN substrates. The output power and wall‐plug efficiency are 5.2 W and 37.0%, respectively, at a current of 3.0 A under continuous‐wave operation. These laser diodes are promising light sources meeting the ITU‐R Recommendation BT.2020 for future laser display applications.

more
Category Display & Visual
Name M. Murayama,
Y. Nakayama,
K. Yamazaki,
Y. Hoshina,
H. Watanabe(Sony Corporation)

In this research, watt‐class green and blue laser diodes, which are fabricated on free‐standing semipolar urn:x-wiley:14381656:media:pssa201700513:pssa201700513-math-0003 GaN and conventional c‐plane GaN substrates, respectively are developed. Although several research groups have recently developed green laser diodes on semipolar GaN substrates, which have weaker piezoelectric fields and higher indium homogeneity in InGaN active regions compared to c‐plane GaN, watt‐level output power has yet to be achieved. By utilizing the urn:x-wiley:14381656:media:pssa201700513:pssa201700513-math-0004 plane, the first watt‐class green lasers at 530 nm is successfully fabricated, and achieve maximum output powers in excess of 2 W, which to the best of our knowledge is the highest value reported for any GaN‐based green laser diode. A wall‐plug efficiency of 17.5% is realized at a current of 1.2 A under continuous‐wave operation, which corresponds to an optical output of approximately 1 W and is the highest value reported to date. In addition, high‐power and high‐efficiency blue laser diodes at 465 nm are successfully fabricated on conventional c‐plane GaN substrates. The output power and wall‐plug efficiency are 5.2 W and 37.0%, respectively, at a current of 3.0 A under continuous‐wave operation. These laser diodes are promising light sources meeting the ITU‐R Recommendation BT.2020 for future laser display applications.

[Academic Conference]The Society for Information Display(SID)
[Theme]New Pixel Driving Circuit Using Self-discharging Compensation Method for High-Resolution OLED Micro Displays on a Silicon Backplane

A new 4T2C pixel circuit formed on a silicon substrate is proposed to realize a high‐resolution 7.8‐μm pixel pitch AMOLED microdisplay. In order to achieve high luminance uniformity, the pixel circuit compensates its Vth variation of the MOSFET for the driving transistor internally by using self‐discharging method. Also presented are 0.5‐in Quad‐VGA and 1.25‐in wide Quad‐XGA microdisplays with the proposed pixel circuit.

more
Category Display & Visual
Name K. Kimura,
Y. Onoyama,
T. Tanaka(Sony Corporation),
N. Toyomura,
H. Kitagawa(Sony Semiconductor Solutions Corporation)

A new 4T2C pixel circuit formed on a silicon substrate is proposed to realize a high‐resolution 7.8‐μm pixel pitch AMOLED microdisplay. In order to achieve high luminance uniformity, the pixel circuit compensates its Vth variation of the MOSFET for the driving transistor internally by using self‐discharging method. Also presented are 0.5‐in Quad‐VGA and 1.25‐in wide Quad‐XGA microdisplays with the proposed pixel circuit.

[Academic Conference]The Society for Information Display(SID)
[Theme]High light extraction efficiency laser-phosphor light source

We investigated the laser‐phosphor light source by using inorganic phosphor wheel. We experimentally confirmed the light extraction efficiency of the inorganic phosphor wheel which is 8% higher than conventional phosphor wheel. In addition, we explain about the cause of improvement of the efficiency by showing fluorescence emission model.

more
Category Display & Visual
Name H.Morita,
Y. Maeda,
I.Kobayashi,
Y. Sato,
T. Nomura,
H.Kikuchi(Sony Corporation)

We investigated the laser‐phosphor light source by using inorganic phosphor wheel. We experimentally confirmed the light extraction efficiency of the inorganic phosphor wheel which is 8% higher than conventional phosphor wheel. In addition, we explain about the cause of improvement of the efficiency by showing fluorescence emission model.

[Academic Conference]Biomedical Engineering Systems and Technologies(BIOSTEC)
[Theme]Wearable Motion Tolerant PPG Sensor for Instant Heart Rate in Daily Activity

A wristband-type PPG heart rate sensor capable of overcoming motion artifacts in daily activity and detecting heart rate variability has been developed together with a motion artifact cancellation framework. In this work, a motion artifact model in daily life was derived and motion artifacts caused by activity of arm, finger, and wrist were cancelled significantly. Highly reliable instant heart rate detection with high noise-resistance was achieved from noise-reduced pulse signals based on peak-detection and autocorrelation methods. The wristband-type PPG heart rate sensor with our motion artifact cancellation framework was compared with ECG instant heart rate measurement in both laboratory and office environments. In a laboratory environment, mean reliability (percentage of time within 10% error relative to ECG instant heart rate) was 86.5% and the one-day pulse-accuracy achievement rate based on time use data of body motions in daily life was 88.1% or approximately 21 hours. Our dev ice and motion artifact cancellation framework enable continuous heart rate variability monitoring in daily life and could be applied to heart rate variability analysis and emotion recognition.

more
Category Medical & Life Science
Name T. Ishikawa,
Y. Hyodo,
K. Miyashita,
K. Yoshifuji,
Y. Imai(Sony Corporation)

A wristband-type PPG heart rate sensor capable of overcoming motion artifacts in daily activity and detecting heart rate variability has been developed together with a motion artifact cancellation framework. In this work, a motion artifact model in daily life was derived and motion artifacts caused by activity of arm, finger, and wrist were cancelled significantly. Highly reliable instant heart rate detection with high noise-resistance was achieved from noise-reduced pulse signals based on peak-detection and autocorrelation methods. The wristband-type PPG heart rate sensor with our motion artifact cancellation framework was compared with ECG instant heart rate measurement in both laboratory and office environments. In a laboratory environment, mean reliability (percentage of time within 10% error relative to ECG instant heart rate) was 86.5% and the one-day pulse-accuracy achievement rate based on time use data of body motions in daily life was 88.1% or approximately 21 hours. Our dev ice and motion artifact cancellation framework enable continuous heart rate variability monitoring in daily life and could be applied to heart rate variability analysis and emotion recognition.

[Academic Conference]The Society for Information Display(SID)
[Theme]A Plastic Holographic Waveguide Combiner for Light-weight and Highly-transparent Augmented Reality Glasses

There is a high demand for light‐weight, stylishly designed augmented reality (AR) glasses with natural see‐through capabilities for the wide‐spread distribution of novel wearable device to general consumers. We have successfully developed a unique production process of a holographic waveguide combiner that enables us to laminate holographic optical elements (HOEs) onto a plastic substrate with optical grade quality. The plastic substrate waveguide combiner has a number of advantages over conventional glass substrate combiners; the plastic substrate makes AR glasses lighter in weight and unbreakable. With the lamination process of HOEs, we can apply them to a various designs to satisfy general customers’ wide range of preferences for the style. We also potentially made it possible for the holographic waveguide combiner to be produced in larger volumes at lower costs by using our novel roll‐to‐roll hologram recording and laminating process. In this paper, we present our approach of the plastic substrate HOE production process for AR glasses.

more
Category Display & Visual
Name T. Yoshida,
K. Tokuyama,
Y. Takai,
D. Tsukuda,
T. Kaneko,
N. Suzuki(Sony Corporation),
T. Anzai(Sony Global Manufacturing & Operations Corporation),
A. Yoshikaie,
K. Akutsu,
A. Machida(Sony Corporation)

There is a high demand for light‐weight, stylishly designed augmented reality (AR) glasses with natural see‐through capabilities for the wide‐spread distribution of novel wearable device to general consumers. We have successfully developed a unique production process of a holographic waveguide combiner that enables us to laminate holographic optical elements (HOEs) onto a plastic substrate with optical grade quality. The plastic substrate waveguide combiner has a number of advantages over conventional glass substrate combiners; the plastic substrate makes AR glasses lighter in weight and unbreakable. With the lamination process of HOEs, we can apply them to a various designs to satisfy general customers’ wide range of preferences for the style. We also potentially made it possible for the holographic waveguide combiner to be produced in larger volumes at lower costs by using our novel roll‐to‐roll hologram recording and laminating process. In this paper, we present our approach of the plastic substrate HOE production process for AR glasses.

[Academic Conference]The Society for Information Display(SID)
[Theme]A Plastic Electrochromic Dimming Device for Augmented Reality Glasses

We have developed an electrochromic dimming device on a plastic substrate with high transparency modulation from 70% to 10% and bending radius below 30 mm. It works more than endurance 10,000 cycles and high‐temperature‐humidity conditions. Combination of the device and AR glass enables the clear image visibility in various environments.

more
Category Display & Visual
Name A. Machida,
K. Kadono,
Y. Ishii,
T. Kono,
H. Takanashi,
A. Nishiike(Sony Corporation),
H. Suzuki,
Y. Nakagawa,
K. Ando,
D. Kasahara,
A. Takeda(Sony Global Manufacturing & Operations Corporation),
K. Nomoto(Sony Corporation)

We have developed an electrochromic dimming device on a plastic substrate with high transparency modulation from 70% to 10% and bending radius below 30 mm. It works more than endurance 10,000 cycles and high‐temperature‐humidity conditions. Combination of the device and AR glass enables the clear image visibility in various environments.

[Academic Conference]International Solid-State Circuits Conference (ISSCC)
[Theme]Projection and sensing technology of Xperia Touch

more
Category Display & Visual
Name K. Kaneda(Sony Corporation)

[Academic Conference]Scientific Reports, 8, 10350
[Theme]Lateral optical confinement of GaN-base d VCSEL using an a tomically smooth monolithic curved mirror

We demonstrate the lateral optical confinement of GaN-based vertical-cavity surface-emitting lasers (GaN-VCSELs) with a cavity containing a curved mirror that is formed monolithically on a GaN wafer. The output wavelength of the devices is 441–455 nm. The threshold current is 40 mA (Jth = 141 kA/cm2) under pulsed current injection (Wp = 100 ns; duty = 0.2%) at room temperature. We confirm the lateral optical confinement by recording near-field images and investigating the dependence of threshold current on aperture size. The beam profile can be fitted with a Gaussian having a theoretical standard deviation of σ = 0.723 µm, which is significantly smaller than previously reported values for GaN-VCSELs with plane mirrors. Lateral optical confinement with this structure theoretically allows aperture miniaturization to the diffraction limit, resulting in threshold currents far lower than sub-milliamperes. The proposed structure enabled GaN-based VCSELs to be constructed with cavities as long as 28.3 µm, which greatly simplifies the fabrication process owing to longitudinal mode spacings of less than a few nanometers and should help the implementation of these devices in practice.

more
Category Display & Visual
Name T. Hamaguchi,
M. Tanaka,
J. Mitomo,
H. Nakajima,
M. Ito,
N. Kobayashi,
K. Fujii,
H. Watanabe,
S. Satou,
M. Ohara,
R. Koda,
H. Narui(Sony Corporation)

We demonstrate the lateral optical confinement of GaN-based vertical-cavity surface-emitting lasers (GaN-VCSELs) with a cavity containing a curved mirror that is formed monolithically on a GaN wafer. The output wavelength of the devices is 441–455 nm. The threshold current is 40 mA (Jth = 141 kA/cm2) under pulsed current injection (Wp = 100 ns; duty = 0.2%) at room temperature. We confirm the lateral optical confinement by recording near-field images and investigating the dependence of threshold current on aperture size. The beam profile can be fitted with a Gaussian having a theoretical standard deviation of σ = 0.723 µm, which is significantly smaller than previously reported values for GaN-VCSELs with plane mirrors. Lateral optical confinement with this structure theoretically allows aperture miniaturization to the diffraction limit, resulting in threshold currents far lower than sub-milliamperes. The proposed structure enabled GaN-based VCSELs to be constructed with cavities as long as 28.3 µm, which greatly simplifies the fabrication process owing to longitudinal mode spacings of less than a few nanometers and should help the implementation of these devices in practice.

[Academic Conference]IEEE Engineering in Medicine and Biology C onference(EMBC)
[Theme]Feature quantities of EEG to characterize human internal states of concentration and relaxation

more
Category Medical & Life Science
Name N. Sazuka,
Y. Komoriya,
T. Ezaki(Sony Corporation),
M. Uraguchi,
H. Ohira(Nagoya University)

[Academic Conference]The Electrochemical Society(AiMES)
[Theme]Atomic Diffusion Bonding for Optical Devices with High Optical Density

An inorganic bonding method providing 100% light transmittance at the bonded interface was proposed for fabricating deviceswith high optical density. First, we fabricated 5000 nm-thick SiO2 oxide underlayers on synthetic quartz glass wafers. After the film surfaces were polished to reduce surface roughness, thewafers with oxide underlayers were bonded using thin Ti films in vacuum at room temperature as a usual atomic diffusion process.After post annealing at 300 °C, 100% light transmittance at the bonded interface with the surface free energy at the bondedinterface greater than 2 J/m2 was achieved. Dissociated oxygen from oxide layers probably enhanced Ti films oxidation, resulting in high light transmittancewith high bonding strength attributable to the annealing. Using this bonding process, we fabricated a polarizing beam splitterand demonstrated that this bonding process is useful to fabricate devices with high optical density.

more
Category Device & Material
Name G. Yonezawa,
Y. Takahashi,
Y. Sato,
S. Abe,
M. Uomoto(Sony Corporation),
T. Shimatsu(Tohoku University)

An inorganic bonding method providing 100% light transmittance at the bonded interface was proposed for fabricating deviceswith high optical density. First, we fabricated 5000 nm-thick SiO2 oxide underlayers on synthetic quartz glass wafers. After the film surfaces were polished to reduce surface roughness, thewafers with oxide underlayers were bonded using thin Ti films in vacuum at room temperature as a usual atomic diffusion process.After post annealing at 300 °C, 100% light transmittance at the bonded interface with the surface free energy at the bondedinterface greater than 2 J/m2 was achieved. Dissociated oxygen from oxide layers probably enhanced Ti films oxidation, resulting in high light transmittancewith high bonding strength attributable to the annealing. Using this bonding process, we fabricated a polarizing beam splitterand demonstrated that this bonding process is useful to fabricate devices with high optical density.

[Academic Conference]Applied Physics Letters, 113, 163302
[Theme]Impact of molecular orientation on energy level alignment at C60/pentacene interfaces

The molecular orientation and the electronic structure at molecular donor/acceptor interfaces play an important role in the performance of organic optoelectronic devices. Here, we show that graphene substrates can be used as templates for tuning the molecular orientation of pentacene (PEN), selectively driving the formation of either face-on or edge-on arrangements by controlling the temperature of the substrate during deposition. The electronic structure and morphology of the two resulting C60/PEN heterointerfaces were elucidated using ultraviolet photoelectron spectroscopy and atomic force microscopy, respectively. While the C60/PEN (edge-on) interface exhibited a vacuum level alignment, the C60/PEN (face-on) interface exhibited a vacuum level shift of 0.2 eV, which was attributed to the formation of an interface dipole that resulted from polarization at the C60/PEN boundary.

more
Category Material Analysis & Simulation
Name T. Nishi,
M. Kanno,
M. Kuribayashi,
Y. Nishida,
S. Hattori,
H. Kobayashi(Sony Corporation),
F. von Wrochem,
V. Rodin,
G. Nelles(Sony Europe Limited, Materials Science Laboratory),
S. Tomiya(Sony Corporation)

The molecular orientation and the electronic structure at molecular donor/acceptor interfaces play an important role in the performance of organic optoelectronic devices. Here, we show that graphene substrates can be used as templates for tuning the molecular orientation of pentacene (PEN), selectively driving the formation of either face-on or edge-on arrangements by controlling the temperature of the substrate during deposition. The electronic structure and morphology of the two resulting C60/PEN heterointerfaces were elucidated using ultraviolet photoelectron spectroscopy and atomic force microscopy, respectively. While the C60/PEN (edge-on) interface exhibited a vacuum level alignment, the C60/PEN (face-on) interface exhibited a vacuum level shift of 0.2 eV, which was attributed to the formation of an interface dipole that resulted from polarization at the C60/PEN boundary.

[Academic Conference]IEEE Vehicular Technology Conference(VTC)
[Theme]GFDM with Different Subcarrier Bandwidths

This paper proposes a generalized frequency division multiplexing (GFDM) modulation scheme that transmits a signal with different subcarrier bandwidths. In a receiver, the GFDM signal is demodulated by using a zero forcing (ZF) algorithm or a minimum mean square error (MMSE) algorithm and the BER performance of these algorithms is related to the condition number of a modulation matrix. This matrix can be optimized by adjusting the roll-off factor of subcarrier filters. It is shown that the performance of the proposed GFDM is about 0.02 dB better than that with a roll-off factor of 0 at a BER of 10-3 on an AWGN channel. On the other hand, on the multipath fading channels, the BER performance improves as the subcarrier bandwidth increases because of frequency diversity.

more
Category Communication
Name Y. Akai,
Y. Enjoji,
Y. Sanada(Keio Univercity),
R. Kimura,
R. Sawai(Sony Corporation)

This paper proposes a generalized frequency division multiplexing (GFDM) modulation scheme that transmits a signal with different subcarrier bandwidths. In a receiver, the GFDM signal is demodulated by using a zero forcing (ZF) algorithm or a minimum mean square error (MMSE) algorithm and the BER performance of these algorithms is related to the condition number of a modulation matrix. This matrix can be optimized by adjusting the roll-off factor of subcarrier filters. It is shown that the performance of the proposed GFDM is about 0.02 dB better than that with a roll-off factor of 0 at a BER of 10-3 on an AWGN channel. On the other hand, on the multipath fading channels, the BER performance improves as the subcarrier bandwidth increases because of frequency diversity.

[Academic Conference]IEEE Vehicular Technology Conference(VTC)
[Theme]A Singularity-free GFDM Modulation scheme with Parametric Shaping Filter Sampling

A GFDM modulation scheme that circumvents the singularity issue of the GFDM transformation matrix is presented. The coefficients used for the pulse shaping filter are derived from the prototype filter depending on the parity of the subsymbols. The proposed pulse shaping filter design makes it possible to have a non-singular transformation matrix for the arbitrary number of subsymbols and/or subcarriers in the sparse frequency-domain GFDM modulation.

more
Category Communication
Name A. Yoshizawa,
R. Kimura,
R. Sawai(Sony Corporation)

A GFDM modulation scheme that circumvents the singularity issue of the GFDM transformation matrix is presented. The coefficients used for the pulse shaping filter are derived from the prototype filter depending on the parity of the subsymbols. The proposed pulse shaping filter design makes it possible to have a non-singular transformation matrix for the arbitrary number of subsymbols and/or subcarriers in the sparse frequency-domain GFDM modulation.

[Academic Conference]European Conference on Computer Vision(ECCV)
[Theme]Scene depth profiling using Helmholtz Stereopsis

Helmholtz stereopsis is a 3D reconstruction technique, capturing surface depth independent of the reflection properties of the material by using Helmholtz reciprocity. In this paper we are interested in studying the applicability of Helmholtz stereopsis for surface and depth profiling of objects and general scenes in the context of perspective stereo imaging. Helmholtz stereopsis captures a pair of reciprocal images by exchanging the position of light source and camera. The resulting image pair relates the image intensities and scene depth profile by a partial differential equation. The solution of this differential equation depends on the boundary conditions provided by the scene. We propose to limit the illumination angle of the light source, such that only mutually visible parts are imaged, resulting in stable boundary conditions. By simulation and experiment we show that a unique depth profile can be recovered for a large class of scenes including multiple occluding objects.

more
Category Computer Vision & CG
Name H. Mori,
R. Koehle,
M. Kamm(Sony Europe Limited)

Helmholtz stereopsis is a 3D reconstruction technique, capturing surface depth independent of the reflection properties of the material by using Helmholtz reciprocity. In this paper we are interested in studying the applicability of Helmholtz stereopsis for surface and depth profiling of objects and general scenes in the context of perspective stereo imaging. Helmholtz stereopsis captures a pair of reciprocal images by exchanging the position of light source and camera. The resulting image pair relates the image intensities and scene depth profile by a partial differential equation. The solution of this differential equation depends on the boundary conditions provided by the scene. We propose to limit the illumination angle of the light source, such that only mutually visible parts are imaged, resulting in stable boundary conditions. By simulation and experiment we show that a unique depth profile can be recovered for a large class of scenes including multiple occluding objects.

[Academic Conference]International Speech Communication Association(Interspeech)
[Theme]Automatic Pronunciation Generation by Utilizing a Semi-Supervised Deep Neural Network

Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs. However they are not the optimal choice due to several reasons such as: large amount of effort required to handcraft a pronunciation dictionary, pronunciation variations, human mistakes and under-resourced dialects and languages. Here, we propose a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary. Experimental results show that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.

more
Category AI & Machine Learning
Name N. Takahashi(Sony Corporation),
T. Naghibi,
B. Pfister,
L. V. Gool(ETH Zurich)

Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs. However they are not the optimal choice due to several reasons such as: large amount of effort required to handcraft a pronunciation dictionary, pronunciation variations, human mistakes and under-resourced dialects and languages. Here, we propose a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary. Experimental results show that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.

[Academic Conference]International Speech Communication Association(Interspeech)
[Theme]Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3×3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

more
Category AI & Machine Learning
Name N. Takahashi(Sony Corporation),
M. Gygli,
B. Pfister,
L. V. Gool(ETH Zurich)

We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3×3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

[Academic Conference]IEEE Advanced Information Networking and Applications (AINA)
[Theme]Dynamic Sensitivity Control based on Two-Hop Farthest Terminal in Dense WLAN

The explosive usage of IEEE 802.11 Wireless Local Area Network (WLAN) has resulted in its dense deployments and excessive interference between Basic Service Sets (BSSs) in urban area such as an apartment building and an airport. Serious problems of hidden/exposed terminal in high-density condition negatively impact system throughput. To improve the system efficiency, IEEE 802.11ax TG has been assembled. TG aims at realizing High-Efficiency- WLAN (HEW) by utilizing special reuse technologies including Dynamic Sensitivity Control (DSC), Transmit Power Control (TPC), and BSS Color Filtering (BCF). In this paper, we propose a DSC based on two-hop farthest terminal for dense WLAN. This scheme with minimum transmission power resolves the hidden terminal problem. Propagation loss of received signal from associated communication pair is used for the proper values of transmission power and carrier sense level. Furthermore, adjusting these parameters destination by destination can reduce exposed terminals effectively. We evaluate the performance of the proposed scheme in residential building scenario with three criteria, aggregate throughput, fairness and frame error rate. Simulation results show that the proposed scheme can improve aggregate downlink throughput and fairness compared to previously proposed method that carrier sense level is set based on expected RSSI level of received packet from communicating pair. Furthermore, improvement of frame loss rate implies that the hidden terminal problem can be solved by the proposed scheme.

more
Category Communication
Name T. Ohnuma,
H. Shigeno (Keio Univercity),
T. Yamaura,
Y. Tanaka(Sony Corporation)

The explosive usage of IEEE 802.11 Wireless Local Area Network (WLAN) has resulted in its dense deployments and excessive interference between Basic Service Sets (BSSs) in urban area such as an apartment building and an airport. Serious problems of hidden/exposed terminal in high-density condition negatively impact system throughput. To improve the system efficiency, IEEE 802.11ax TG has been assembled. TG aims at realizing High-Efficiency- WLAN (HEW) by utilizing special reuse technologies including Dynamic Sensitivity Control (DSC), Transmit Power Control (TPC), and BSS Color Filtering (BCF). In this paper, we propose a DSC based on two-hop farthest terminal for dense WLAN. This scheme with minimum transmission power resolves the hidden terminal problem. Propagation loss of received signal from associated communication pair is used for the proper values of transmission power and carrier sense level. Furthermore, adjusting these parameters destination by destination can reduce exposed terminals effectively. We evaluate the performance of the proposed scheme in residential building scenario with three criteria, aggregate throughput, fairness and frame error rate. Simulation results show that the proposed scheme can improve aggregate downlink throughput and fairness compared to previously proposed method that carrier sense level is set based on expected RSSI level of received packet from communicating pair. Furthermore, improvement of frame loss rate implies that the hidden terminal problem can be solved by the proposed scheme.

[Academic Conference]IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)
[Theme]Multichannel blind source separation based on non-negative tensor factorization in wavenumber domain

Multichannel non-negative matrix factorization based on a spatial covariance model is one of the most promising techniques for blind source separation. However, this approach is not tractable for a large number of microphones, M, because the computational cost is of order O(M3) per time-frequency bin. To circumvent this drawback, we propose non-negative tensor factorization in the wavenumber domain, which reduces the cost to the order O(M). It transforms microphone signals into the spatial frequency domain, a technique that is commonly used for soundfield reconstruction. The proposed method is compared to several blind source separation (BSS) methods in terms of separation quality and computational cost.

more
Category Audio & Acoustics
Name Y. Mitsufuji(Sony Corporation),
S. Koyama,
H. Saruwatari(The University of Tokyo)

Multichannel non-negative matrix factorization based on a spatial covariance model is one of the most promising techniques for blind source separation. However, this approach is not tractable for a large number of microphones, M, because the computational cost is of order O(M3) per time-frequency bin. To circumvent this drawback, we propose non-negative tensor factorization in the wavenumber domain, which reduces the cost to the order O(M). It transforms microphone signals into the spatial frequency domain, a technique that is commonly used for soundfield reconstruction. The proposed method is compared to several blind source separation (BSS) methods in terms of separation quality and computational cost.

[Academic Conference]International Conference on Pattern Recognition(ICPR)
[Theme]Latent Model Ensemble with Auto-Localization

Deep Convolutional Neural Networks (CNN) have exhibited superior performance in many visual recognition tasks including image classification, object detection, and scene label- ing, due to their large learning capacity and resistance to overfit. For the image classification task, most of the current deep CNN- based approaches take the whole size-normalized image as input and have achieved quite promising results. Compared with the previously dominating approaches based on feature extraction, pooling, and classification, the deep CNN-based approaches mainly rely on the learning capability of deep CNN to achieve superior results: the burden of minimizing intra-class variation while maximizing inter-class difference is entirely dependent on the implicit feature learning component of deep CNN; we rely upon the implicitly learned filters and pooling component to select the discriminative regions, which correspond to the activated neurons. However, if the irrelevant regions constitute a large portion of the image of interest, the classification performance of the deep CNN, which takes the whole image as input, can be heavily affected. To solve this issue, we propose a novel latent CNN framework, which treats the most discriminate region as a latent variable. We can jointly learn the global CNN with the latent CNN to avoid the aforementioned big irrelevant region issue, and our experimental results show the evident advantage of the proposed latent CNN over traditional deep CNN: latent CNN outperforms the state-of-the-art performance of deep CNN on standard benchmark datasets including the CIFAR-10, CIFAR- 100, MNIST and PASCAL VOC 2007 Classification dataset.

more
Category AI & Machine Learning
Name M. Sun,
T. X. Han(University of Missouri),
X. Xu,
M-C Liu,
A. K-Rostamabad(Sony Electronics Inc)

Deep Convolutional Neural Networks (CNN) have exhibited superior performance in many visual recognition tasks including image classification, object detection, and scene label- ing, due to their large learning capacity and resistance to overfit. For the image classification task, most of the current deep CNN- based approaches take the whole size-normalized image as input and have achieved quite promising results. Compared with the previously dominating approaches based on feature extraction, pooling, and classification, the deep CNN-based approaches mainly rely on the learning capability of deep CNN to achieve superior results: the burden of minimizing intra-class variation while maximizing inter-class difference is entirely dependent on the implicit feature learning component of deep CNN; we rely upon the implicitly learned filters and pooling component to select the discriminative regions, which correspond to the activated neurons. However, if the irrelevant regions constitute a large portion of the image of interest, the classification performance of the deep CNN, which takes the whole image as input, can be heavily affected. To solve this issue, we propose a novel latent CNN framework, which treats the most discriminate region as a latent variable. We can jointly learn the global CNN with the latent CNN to avoid the aforementioned big irrelevant region issue, and our experimental results show the evident advantage of the proposed latent CNN over traditional deep CNN: latent CNN outperforms the state-of-the-art performance of deep CNN on standard benchmark datasets including the CIFAR-10, CIFAR- 100, MNIST and PASCAL VOC 2007 Classification dataset.

[Academic Conference]IEEE Dynamic Spectrum Access Networks(DySPAN)
[Theme]Aggregate Interference Prediction Based on Back-Propagation Neural Network

In dynamic spectrum access (DSA) scenarios, dense and complex deployment (e.g., in nonuniform or unknown radio propagation environment) of secondary systems (SSs) will make aggregate interference estimation highly complicated or challenging for reliable primary system (PS) protection. To tackle this problem, a back-propagation (BP) neural network based aggregate interference prediction method is proposed and evaluated via simulations. This paper also gives design guidelines of BP neural network appropriate for aggregate interference prediction via revealing the impact of several key factors on the prediction accuracy, such as the number of input parameters to the neural network, the coordinate system in use, and the number of hidden neurons.

more
Category AI & Machine Learning
Name Y. Zhao,
L. Shi (Beijing Jiaotong University),
X. Guo,
C. Sun (SCRL)

In dynamic spectrum access (DSA) scenarios, dense and complex deployment (e.g., in nonuniform or unknown radio propagation environment) of secondary systems (SSs) will make aggregate interference estimation highly complicated or challenging for reliable primary system (PS) protection. To tackle this problem, a back-propagation (BP) neural network based aggregate interference prediction method is proposed and evaluated via simulations. This paper also gives design guidelines of BP neural network appropriate for aggregate interference prediction via revealing the impact of several key factors on the prediction accuracy, such as the number of input parameters to the neural network, the coordinate system in use, and the number of hidden neurons.

[Academic Conference]Empirical Methods on Natural Language Processing(EMNLP)
[Theme]Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.

more
Category AI & Machine Learning
Name A. Fukui(Sony Corporation/University of California, Berkeley),
D. H. Park,
D. Yang(University of California, Berkeley),
A. Rohrbach(University of California,
Berkeley/Max Planck Instisute of Technology),
T. Darrell,
M. Rohrbach(University of California, Berkeley)

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.

[Academic Conference]Association for Computational Linguistics(ACL)
[Theme]Domain Adaptation for Neural Networks by Parameter Augmentation

We propose a simple domain adaptation method for neural networks in a supervised setting. Supervised domain adaptation is a way of improving the generalization performance on the target domain by using the source domain dataset, assuming that both of the datasets are labeled. Recently, recurrent neural networks have been shown to be successful on a variety of NLP tasks such as caption generation; however, the existing domain adaptation techniques are limited to (1) tune the model parameters by the target dataset after the training by the source dataset, or (2) design the network to have dual output, one for the source domain and the other for the target domain. Reformulating the idea of the domain adaptation technique proposed by Daume (2007), we propose a simple domain adaptation method, which can be applied to neural networks trained with a cross-entropy loss. On captioning datasets, we show performance improvements over other domain adaptation methods.

more
Category AI & Machine Learning
Name Y. Watanabe(Sony Corporation),
K. Hashimoto,
Y. Tsuruoka(The Univercity of Tokyo)

We propose a simple domain adaptation method for neural networks in a supervised setting. Supervised domain adaptation is a way of improving the generalization performance on the target domain by using the source domain dataset, assuming that both of the datasets are labeled. Recently, recurrent neural networks have been shown to be successful on a variety of NLP tasks such as caption generation; however, the existing domain adaptation techniques are limited to (1) tune the model parameters by the target dataset after the training by the source dataset, or (2) design the network to have dual output, one for the source domain and the other for the target domain. Reformulating the idea of the domain adaptation technique proposed by Daume (2007), we propose a simple domain adaptation method, which can be applied to neural networks trained with a cross-entropy loss. On captioning datasets, we show performance improvements over other domain adaptation methods.

[Academic Conference]International Conference of Intelligent Robotic and Control Engineering (IRCE)
[Theme]One-shot Learning Gesture Recognition Based on Evolution of Discrimination with Successive Memory

In this paper, a one-shot learning gesture recognition algorithm based on evolution of discrimination with successive memory is presented, which utilizes the transferability of large-scale pre-trained DNN (Deep Neural Network) gesture recognition model and distance discrimination to carry out high-performance recognition with evolutionary discrimination. Our scheme can be narrated as follows. Firstly, a DNN gesture recognition model is proactively trained by a sample set with 19 classes of BSG dataset as a transferable model with its powerful extractor of features. Secondly, the transferable capacity of extractor is employed to extract features of labeled root samples and test samples respectively towards one-shot learning gesture recognition so as to achieve a high performance feature extraction and structured arraying. Finally, the discriminative recognition can be carried out with Euclidean distance measure between the root features and test features. Meanwhile a mechanism of updating and evolution of root features memory is built and utilized for one-shot learning gesture recognition so as to enhance the performance of recognition. A kind of software for online one-shot learning gesture recognition towards practical applications is designed and developed to achieve outstanding performance with fast response speed and high recognition accuracy. A series of experiments on the additional 10 classes of BSG dataset are conducted to verify and validate the performance advantages of our proposed one-shot learning gesture recognition algorithm.

more
Category AI & Machine Learning
Name X. Li,
S. Qin (BUAA School of Automation Science and Electrical Engineering)
Kuanhong Xu,
Zhongying Hu (SCRL)

In this paper, a one-shot learning gesture recognition algorithm based on evolution of discrimination with successive memory is presented, which utilizes the transferability of large-scale pre-trained DNN (Deep Neural Network) gesture recognition model and distance discrimination to carry out high-performance recognition with evolutionary discrimination. Our scheme can be narrated as follows. Firstly, a DNN gesture recognition model is proactively trained by a sample set with 19 classes of BSG dataset as a transferable model with its powerful extractor of features. Secondly, the transferable capacity of extractor is employed to extract features of labeled root samples and test samples respectively towards one-shot learning gesture recognition so as to achieve a high performance feature extraction and structured arraying. Finally, the discriminative recognition can be carried out with Euclidean distance measure between the root features and test features. Meanwhile a mechanism of updating and evolution of root features memory is built and utilized for one-shot learning gesture recognition so as to enhance the performance of recognition. A kind of software for online one-shot learning gesture recognition towards practical applications is designed and developed to achieve outstanding performance with fast response speed and high recognition accuracy. A series of experiments on the additional 10 classes of BSG dataset are conducted to verify and validate the performance advantages of our proposed one-shot learning gesture recognition algorithm.

[Academic Conference]IEEE Robotics and Automation Letters (RA-L) with ICRA2018 option
[Theme]Comparison between Force-controlled Skin Deformation Feedback and Hand-Grounded Kinesthetic    
  Force Feedback for Sensory Substitution

Teleoperation and virtual reality systems benefit from force sensory substitution when kinesthetic force feedback devices are infeasible due to stability or workspace limitations. We compared the performance of sensory substitution when it is provided through a cutaneous method (skin deformation feedback) and a kinesthetic method (hand-grounded force feedback). For skin deformation feedback, we used a new force-controlled tactile sensory substitution device with the ability to provide tangential and normal force directly to the finger pad. Three-axis force control with 15 Hz bandwidth was achieved using a delta mechanism and three-axis force sensor. For hand-grounded force feedback, forces were grounded against the palm. As a control, world-grounded force feedback was provided using a three-degree-of-freedom kinesthetic force feedback device. Study participants were able to match a reference world-grounded force better with hand-grounded kinesthetic force feedback than with skin deformation feedback. Participants were also able to apply more accurate and precise forces with hand-grounded kinesthetic force feedback than with skin deformation feedback. Conversely, skin deformation feedback resulted in the lowest error during initial force adjustment. These experiments demonstrate relative advantages and disadvantages of skin deformation and hand-grounded kinesthetic force feedback for force sensory substitution.

more
Category Robotics
Name Y. Kamikawa(Sony Corporation)

Teleoperation and virtual reality systems benefit from force sensory substitution when kinesthetic force feedback devices are infeasible due to stability or workspace limitations. We compared the performance of sensory substitution when it is provided through a cutaneous method (skin deformation feedback) and a kinesthetic method (hand-grounded force feedback). For skin deformation feedback, we used a new force-controlled tactile sensory substitution device with the ability to provide tangential and normal force directly to the finger pad. Three-axis force control with 15 Hz bandwidth was achieved using a delta mechanism and three-axis force sensor. For hand-grounded force feedback, forces were grounded against the palm. As a control, world-grounded force feedback was provided using a three-degree-of-freedom kinesthetic force feedback device. Study participants were able to match a reference world-grounded force better with hand-grounded kinesthetic force feedback than with skin deformation feedback. Participants were also able to apply more accurate and precise forces with hand-grounded kinesthetic force feedback than with skin deformation feedback. Conversely, skin deformation feedback resulted in the lowest error during initial force adjustment. These experiments demonstrate relative advantages and disadvantages of skin deformation and hand-grounded kinesthetic force feedback for force sensory substitution.

[Academic Conference]IEEE International Conference on Robotics and Automation(ICRA)
[Theme]Latency and Refresh Rate on Force Perception via Sensory Substitution by Force-Controlled Skin Deformation Feedback 

Latency and refresh rate are known to adversely affect human force perception in bilateral teleoperators and virtual environments using kinesthetic force feedback, motivating the use of sensory substitution of force. The purpose of this study is to quantify the effects of latency and refresh rate on force perception using sensory substitution by skin deformation feedback. A force-controlled skin deformation feedback device was attached to a 3-degree-of-freedom kinesthetic force feedback device used for position tracking and gravity support. A human participant study was conducted to determine the effects of latency and refresh rate on perceived stiffness and damping with skin deformation feedback. Participants compared two virtual objects: a comparison object with stiffness or damping that could be tuned by the participant, and a reference object with either added latency or reduced refresh rate. Participants modified the stiffness or damping of the tunable object until it resembled the stiffness or damping of the reference object. We found that added latency and reduced refresh rate both increased perceived stiffness but had no effect on perceived damping. Specifically, participants felt significantly different stiffness when the latency exceeded 300 ms and the refresh rate dropped below 16.6 Hz. The impact of latency and refresh rate on force perception via skin deformation feedback was significantly less than what has been previously shown for kinesthetic force feedback.

more
Category Robotics
Name Z. A. Zook,
A. M. Okamura(Stanford University),
Y. Kamikawa(Sony Corporation)

Latency and refresh rate are known to adversely affect human force perception in bilateral teleoperators and virtual environments using kinesthetic force feedback, motivating the use of sensory substitution of force. The purpose of this study is to quantify the effects of latency and refresh rate on force perception using sensory substitution by skin deformation feedback. A force-controlled skin deformation feedback device was attached to a 3-degree-of-freedom kinesthetic force feedback device used for position tracking and gravity support. A human participant study was conducted to determine the effects of latency and refresh rate on perceived stiffness and damping with skin deformation feedback. Participants compared two virtual objects: a comparison object with stiffness or damping that could be tuned by the participant, and a reference object with either added latency or reduced refresh rate. Participants modified the stiffness or damping of the tunable object until it resembled the stiffness or damping of the reference object. We found that added latency and reduced refresh rate both increased perceived stiffness but had no effect on perceived damping. Specifically, participants felt significantly different stiffness when the latency exceeded 300 ms and the refresh rate dropped below 16.6 Hz. The impact of latency and refresh rate on force perception via skin deformation feedback was significantly less than what has been previously shown for kinesthetic force feedback.

[Academic Conference]IEEE International Conference on Robotics and Automation(ICRA)
[Theme]Magnified Force Sensory Substitution for Telemanipulation  via Force-Controlled Skin Deformation

Teleoperation systems could benefit from force sensory substitution when kinesthetic force feedback systems are too bulky or expensive, and when they cause instability by magnifying force feedback. We aim to magnify force feedback using sensory substitution via force-controlled tactile skin deformation, using a device with the ability to provide tangential and normal force directly to the fingerpads. The sensory substitution device is able to provide skin deformation force feedback over ten times the maximum stable kinesthetic force feedback on a da Vinci Research Kit teleoperation system. We evaluated the effect of this force magnification in two experimental tasks where the goal was to minimize interaction force with the environment. In a peg transfer task, magnified force feedback using sensory substitution improved participants’ performance for force magnifications up to ten times, but decreased performance for higher force magnifications. In a tube connection task, sensory substitution that doubled the force feedback maximized performance; there was no improvement at the larger magnifications. These experiments demonstrate that magnified force feedback using sensory substitution via force-controlled skin deformation feedback can decrease applied forces similarly to magnified kinesthetic force feedback during teleoperation.

more
Category Robotics
Name Y. Kamikawa(Sony Corporation)

Teleoperation systems could benefit from force sensory substitution when kinesthetic force feedback systems are too bulky or expensive, and when they cause instability by magnifying force feedback. We aim to magnify force feedback using sensory substitution via force-controlled tactile skin deformation, using a device with the ability to provide tangential and normal force directly to the fingerpads. The sensory substitution device is able to provide skin deformation force feedback over ten times the maximum stable kinesthetic force feedback on a da Vinci Research Kit teleoperation system. We evaluated the effect of this force magnification in two experimental tasks where the goal was to minimize interaction force with the environment. In a peg transfer task, magnified force feedback using sensory substitution improved participants’ performance for force magnifications up to ten times, but decreased performance for higher force magnifications. In a tube connection task, sensory substitution that doubled the force feedback maximized performance; there was no improvement at the larger magnifications. These experiments demonstrate that magnified force feedback using sensory substitution via force-controlled skin deformation feedback can decrease applied forces similarly to magnified kinesthetic force feedback during teleoperation.

[Academic Conference]International Speech Communication Association(Interspeech)
[Theme]Attention-based Convolutional Neural Networks for Sentence Classification

Sentence classification is one of the foundational tasks in spoken
language understanding (SLU) and natural language processing
(NLP). In this paper we propose a novel convolutional
neural network (CNN) with attention mechanism to improve
the performance of sentence classification. In traditional CNN,
it is not easy to encode long term contextual information
and correlation between non-consecutive words effectively.
In contrast, our attention-based CNN is able to capture these
kinds of information for each word without any external features.
We conducted experiments on various public and inhouse
datasets. The experimental results demonstrate that our
proposed model significantly outperforms the traditional CNN
model and achieves competitive performance with the ones that
exploit rich syntactic features.

more
Category AI & Machine Learning
Name Z. Zhao,
Y. Wu(Sony(China)Limited)

Sentence classification is one of the foundational tasks in spoken
language understanding (SLU) and natural language processing
(NLP). In this paper we propose a novel convolutional
neural network (CNN) with attention mechanism to improve
the performance of sentence classification. In traditional CNN,
it is not easy to encode long term contextual information
and correlation between non-consecutive words effectively.
In contrast, our attention-based CNN is able to capture these
kinds of information for each word without any external features.
We conducted experiments on various public and inhouse
datasets. The experimental results demonstrate that our
proposed model significantly outperforms the traditional CNN
model and achieves competitive performance with the ones that
exploit rich syntactic features.

[Academic Conference]IEEE Computer Vision and Pattern Recognition(CVPR)
[Theme]Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding

Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization. From an affinity matrix describing pairwise relationships between pixels, it clusters pixels into regions, and, using a complex-valued extension, orders pixels according to layer. We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. Spectral embedding then resolves these predictions into a globally-consistent segmentation and figure/ground organization of the scene. Experiments demonstrate significant benefit to this direct coupling compared to prior works which use explicit intermediate stages, such as edge detection, on the pathway from image to affinities. Our results suggest spectral embedding as a powerful alternative to the conditional random field (CRF)-based globalization schemes typically coupled to deep neural networks.

more
Category AI & Machine Learning
Name M. Maire(Toyota Technological Institute at Chicago),
T. Narihira(Sony/University of California, Berkeley),
S. X. Yu(University of California, Berkeley)

Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization. From an affinity matrix describing pairwise relationships between pixels, it clusters pixels into regions, and, using a complex-valued extension, orders pixels according to layer. We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. Spectral embedding then resolves these predictions into a globally-consistent segmentation and figure/ground organization of the scene. Experiments demonstrate significant benefit to this direct coupling compared to prior works which use explicit intermediate stages, such as edge detection, on the pathway from image to affinities. Our results suggest spectral embedding as a powerful alternative to the conditional random field (CRF)-based globalization schemes typically coupled to deep neural networks.

[Academic Conference]Association for the Advancement of Artificial Intelligence(AAAI)
[Theme]Modeling Human Understanding of Complex Intentional Action with a Bayesian Nonparametric Subgoal Model

Most human behaviors consist of multiple parts, steps, or subtasks.
These structures guide our action planning and execution,
but when we observe others, the latent structure of their
actions is typically unobservable, and must be inferred in order
to learn new skills by demonstration, or to assist others
in completing their tasks. For example, an assistant who has
learned the subgoal structure of a colleague’s task can more
rapidly recognize and support their actions as they unfold.
Here we model how humans infer subgoals from observations
of complex action sequences using a nonparametric Bayesian
model, which assumes that observed actions are generated by
approximately rational planning over unknown subgoal sequences.
We test this model with a behavioral experiment in
which humans observed different series of goal-directed actions,
and inferred both the number and composition of the
subgoal sequences associated with each goal. The Bayesian
model predicts human subgoal inferences with high accuracy,
and significantly better than several alternative models and
straightforward heuristics. Motivated by this result, we simulate
how learning and inference of subgoals can improve performance
in an artificial user assistance task. The Bayesian
model learns the correct subgoals from fewer observations,
and better assists users by more rapidly and accurately inferring
the goal of their actions than alternative approaches.

more
Category AI & Machine Learning
Name R. Nakahashi(Sony Corporation/Masachusetts Institute of Technology),
C. L. Baker,
J. B. Tenenbaum(Masachusetts Institute of Technology)

Most human behaviors consist of multiple parts, steps, or subtasks.
These structures guide our action planning and execution,
but when we observe others, the latent structure of their
actions is typically unobservable, and must be inferred in order
to learn new skills by demonstration, or to assist others
in completing their tasks. For example, an assistant who has
learned the subgoal structure of a colleague’s task can more
rapidly recognize and support their actions as they unfold.
Here we model how humans infer subgoals from observations
of complex action sequences using a nonparametric Bayesian
model, which assumes that observed actions are generated by
approximately rational planning over unknown subgoal sequences.
We test this model with a behavioral experiment in
which humans observed different series of goal-directed actions,
and inferred both the number and composition of the
subgoal sequences associated with each goal. The Bayesian
model predicts human subgoal inferences with high accuracy,
and significantly better than several alternative models and
straightforward heuristics. Motivated by this result, we simulate
how learning and inference of subgoals can improve performance
in an artificial user assistance task. The Bayesian
model learns the correct subgoals from fewer observations,
and better assists users by more rapidly and accurately inferring
the goal of their actions than alternative approaches.

[Academic Conference]IEEE Signal Processing Advances in Wireless Communications(SPAWC)
[Theme]Low Complexity Beamforming Training Method for mmWave Communications

This paper introduces a low complexity method for antenna sector selection in mmWave Hybrid MIMO communication systems like the IEEE 802.11ay amendment for Wireless LANs. The method is backwards compatible to the methods already defined for the released mmWave standard IEEE 802.11ad. We introduce an extension of the 802.11ad channel model to support common Hybrid MIMO configurations. The proposed method is evaluated and compared to the theoretical limit of transmission rates found by exhaustive search. In contrast to state-of-the-art solutions, the presented method requires sparse channel information only. Numerical results show a significant complexity reduction in terms of number of necessary trainings, while approaching maximum achievable rate.

more
Category Communication
Name F. Fellhauer(Sony Europe Limited/University of Stuttgart),
N. Loghin,
D. Ciochina,
T. Handte(Sony Europe Limited),
S. ten Brink(University of Stuttgart)

This paper introduces a low complexity method for antenna sector selection in mmWave Hybrid MIMO communication systems like the IEEE 802.11ay amendment for Wireless LANs. The method is backwards compatible to the methods already defined for the released mmWave standard IEEE 802.11ad. We introduce an extension of the 802.11ad channel model to support common Hybrid MIMO configurations. The proposed method is evaluated and compared to the theoretical limit of transmission rates found by exhaustive search. In contrast to state-of-the-art solutions, the presented method requires sparse channel information only. Numerical results show a significant complexity reduction in terms of number of necessary trainings, while approaching maximum achievable rate.

[Academic Conference]IEEE Broadband Multimedia Systems and Broadcasting(BMSB)
[Theme]Terrestrial broadcast system using preamble and frequency division multiplexing

Broadcast systems based on FDM (Frequency Division Multiplex) have the advantage of near continuous demodulation of the broadcast signal, allowing accurate and continuous tracking of channel conditions which is particularly useful for mobile reception. This has been employed in the ISDB-T standard used in Japan, Brazil and other countries. However, as designed in ISDB-T the broadcast signal lacks the ability to send system parameters such as FFT size, GI size and so on before the receiver begins demodulation. The receiver must blindly estimate such system parameters before it can read the other detailed parameter information using the TMCC pilot carriers. This takes time, usually one frame or longer. This paper proposes a next generation FDM system which enables the original advantages of FDM to be retained, while allowing additional advantages by employing an additional small signal (Preamble 1) which imparts essential information such as FFT size, GI size and pilot pattern to the receiver to enable immediate demodulation of the broadcast signal based on known parameters rather than blind estimation. Following demodulation of the first preamble, demodulation of the second preamble (Preamble 2) allows immediate knowledge of the all subsequent parameters contributing to faster demodulation of the overall signal.

more
Category B2B & Professional
Name L. Michael,
K. Takahashi,
Y. Shinohara,
L. Sakai,
M. Kan(Sony Corporation),
S. Atungsiri(Sony Europe Limited)

Broadcast systems based on FDM (Frequency Division Multiplex) have the advantage of near continuous demodulation of the broadcast signal, allowing accurate and continuous tracking of channel conditions which is particularly useful for mobile reception. This has been employed in the ISDB-T standard used in Japan, Brazil and other countries. However, as designed in ISDB-T the broadcast signal lacks the ability to send system parameters such as FFT size, GI size and so on before the receiver begins demodulation. The receiver must blindly estimate such system parameters before it can read the other detailed parameter information using the TMCC pilot carriers. This takes time, usually one frame or longer. This paper proposes a next generation FDM system which enables the original advantages of FDM to be retained, while allowing additional advantages by employing an additional small signal (Preamble 1) which imparts essential information such as FFT size, GI size and pilot pattern to the receiver to enable immediate demodulation of the broadcast signal based on known parameters rather than blind estimation. Following demodulation of the first preamble, demodulation of the second preamble (Preamble 2) allows immediate knowledge of the all subsequent parameters contributing to faster demodulation of the overall signal.

[Academic Conference]IEEE Transactions on Multimedia
[Theme]AENet: Learning Deep Audio Features for Video Analysis

We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.

more
Category AI & Machine Learning
Name N. Takahashi(Sony Corporation),
M. Gygli,
L. Van Gool(ETH Zurich)

We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.

[Academic Conference]IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)
[Theme]Improving music source separation based on deep neural networks through data augmentation and network blending

This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is used. Furthermore, we propose a blending of both neural network systems where we linearly combine their raw outputs and then perform a multi-channel Wiener filter post-processing. This blending scheme yields the best results that have been reported to-date on the SiSEC DSD100 dataset.

more
Category AI & Machine Learning
Name S. Uhlich,
M. Porcu,
F. Giron,
M. Enenkl,
T. Kemp(Sony Europe Limited),
N. Takahashi,
Y. Mitsufuji(Sony Corporation)

This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is used. Furthermore, we propose a blending of both neural network systems where we linearly combine their raw outputs and then perform a multi-channel Wiener filter post-processing. This blending scheme yields the best results that have been reported to-date on the SiSEC DSD100 dataset.

[Academic Conference]IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)
[Theme]Supervised monaural source separation based on autoencoders

In this paper, we propose a new supervised monaural source separation based on autoencoders. We employ the autoencoder for the dictionary training such that the nonlinear network can encode the target source with high expressiveness. The dictionary is trained by each target source without the mixture signal, which makes the system independent from the context where the dictionaries will be used. In separation process, the decoder portions of the trained autoencoders are used as dictionaries to find the activations in a iterative manner such that a summation of the decoder outputs approximates the original mixture. The results of the instruments source separation experiments revealed that the separation performance of the proposed method was superior to that of the NMF.

more
Category AI & Machine Learning
Name K. Osako,
Y. Mitsufuji(Sony Corporation),
R. Singh,
B. Raj(Sony(China)Limited)

In this paper, we propose a new supervised monaural source separation based on autoencoders. We employ the autoencoder for the dictionary training such that the nonlinear network can encode the target source with high expressiveness. The dictionary is trained by each target source without the mixture signal, which makes the system independent from the context where the dictionaries will be used. In separation process, the decoder portions of the trained autoencoders are used as dictionaries to find the activations in a iterative manner such that a summation of the decoder outputs approximates the original mixture. The results of the instruments source separation experiments revealed that the separation performance of the proposed method was superior to that of the NMF.

[Academic Conference]IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA)
[Theme]Multi-Scale Multi-Band DenseNets for Audio Source Separation

This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.

more
Category Audio & Acoustics
Name N. Takahashi,
Y. Mitsufuji(Sony Corporation)

This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.

[Academic Conference]International Speech Communication Association(Interspeech)
[Theme]Hierarchical Recurrent Neural Network for Story Segmentation

A broadcast news stream consists of a number of stories and
each story consists of several sentences. We capture this structure
using a hierarchical model based on a word-level Recurrent
Neural Network (RNN) sentence modeling layer and a
sentence-level bidirectional Long Short-Term Memory (LSTM)
topic modeling layer. First, the word-level RNN layer extracts a
vector embedding the sentence information from the given transcribed
lexical tokens of each sentence. These sentence embedding
vectors are fed into a bidirectional LSTM that models the
sentence and topic transitions. A topic posterior for each sentence
is estimated discriminatively and a Hidden Markov model
(HMM) follows to decode the story sequence and identify story
boundaries. Experiments on the topic detection and tracking
(TDT2) task indicate that the hierarchical RNN topic modeling
achieves the best story segmentation performance with a higher
F1-measure compared to conventional state-of-the-art methods.
We also compare variations of our model to infer the optimal
structure for the story segmentation task.
Index Terms: spoken language processing, recurrent neural
network, topic modeling, story segmentation

more
Category AI & Machine Learning
Name E. Tsunoo(The University of Edinburgh/Sony Corporation),  
O. Klejch,
P. Bell,
S. Renals(The University of Edinburgh)

A broadcast news stream consists of a number of stories and
each story consists of several sentences. We capture this structure
using a hierarchical model based on a word-level Recurrent
Neural Network (RNN) sentence modeling layer and a
sentence-level bidirectional Long Short-Term Memory (LSTM)
topic modeling layer. First, the word-level RNN layer extracts a
vector embedding the sentence information from the given transcribed
lexical tokens of each sentence. These sentence embedding
vectors are fed into a bidirectional LSTM that models the
sentence and topic transitions. A topic posterior for each sentence
is estimated discriminatively and a Hidden Markov model
(HMM) follows to decode the story sequence and identify story
boundaries. Experiments on the topic detection and tracking
(TDT2) task indicate that the hierarchical RNN topic modeling
achieves the best story segmentation performance with a higher
F1-measure compared to conventional state-of-the-art methods.
We also compare variations of our model to infer the optimal
structure for the story segmentation task.
Index Terms: spoken language processing, recurrent neural
network, topic modeling, story segmentation

[Academic Conference]IEEE Automatic Speech Recognition and Understanding(ASRU)
[Theme]Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features

A broadcast news stream consists of a number of stories and it is an important task to find the boundaries of stories automatically in news analysis. We capture the topic structure using a hierarchical model based on a Recurrent Neural Network (RNN) sentence modeling layer and a bidirectional Long Short-Term Memory (LSTM) topic modeling layer, with a fusion of acoustic and lexical features. Both features are accumulated with RNNs and trained jointly within the model to be fused at the sentence level. We conduct experiments on the topic detection and tracking (TDT4) task comparing combinations of two modalities trained with limited amount of parallel data. Further we utilize additional sufficient text data for training to polish our model. Experimental results indicate that the hierarchical RNN topic modeling takes advantage of the fusion scheme, especially with additional text training data, with a higher F1-measure compared to conventional state-of-the-art methods.

more
Category AI & Machine Learning
Name E. Tsunoo(The University of Edinburgh/Sony Corporation),  
O. Klejch,
P. Bell,
S. Renals(The University of Edinburgh)

A broadcast news stream consists of a number of stories and it is an important task to find the boundaries of stories automatically in news analysis. We capture the topic structure using a hierarchical model based on a Recurrent Neural Network (RNN) sentence modeling layer and a bidirectional Long Short-Term Memory (LSTM) topic modeling layer, with a fusion of acoustic and lexical features. Both features are accumulated with RNNs and trained jointly within the model to be fused at the sentence level. We conduct experiments on the topic detection and tracking (TDT4) task comparing combinations of two modalities trained with limited amount of parallel data. Further we utilize additional sufficient text data for training to polish our model. Experimental results indicate that the hierarchical RNN topic modeling takes advantage of the fusion scheme, especially with additional text training data, with a higher F1-measure compared to conventional state-of-the-art methods.

[Academic Conference]IEEE International Conference on Communications(ICC)
[Theme]MocLis: A Moving Cell Support Protocol Based on Locator/ID Separation for 5G System

In the LTE/LTE-Advanced (LTE-A) system, user-plane for a user equipment (UE) is provided by tunneling which causes header overhead, processing overhead, and management overhead. In addition, the LTE-A system does not support moving cells which are composed of a mobile Relay Node (RN) and UEs attached to the mobile RN. There are several proposals for moving cells in the LTE-A system and the 5G system, however, all of them rely on tunneling for the user-plane, which means that of them avoid the tunneling overheads. This paper proposes MocLis, a moving cell support protocol based on a Locator/ID split approach. MocLis does not use tunneling. Nested moving cells are supported. Signaling cost for handover of a moving cell is independent of the number of UEs and nested RNs in the moving cell. MocLis is implemented in Linux; user space daemons and modified kernel. The measurement results show that the attachment time and handover time are short enough for practical use. TCP throughput in MocLis is faster than that in the tunneling based approaches.

more
Category Communication
Name T. Ochiai,
K. Matsueda,
F. Teraoka (Keio University, Japan)
H. Takano,
R. Kimura,
R. Sawai(Sony Corporation)

In the LTE/LTE-Advanced (LTE-A) system, user-plane for a user equipment (UE) is provided by tunneling which causes header overhead, processing overhead, and management overhead. In addition, the LTE-A system does not support moving cells which are composed of a mobile Relay Node (RN) and UEs attached to the mobile RN. There are several proposals for moving cells in the LTE-A system and the 5G system, however, all of them rely on tunneling for the user-plane, which means that of them avoid the tunneling overheads. This paper proposes MocLis, a moving cell support protocol based on a Locator/ID split approach. MocLis does not use tunneling. Nested moving cells are supported. Signaling cost for handover of a moving cell is independent of the number of UEs and nested RNs in the moving cell. MocLis is implemented in Linux; user space daemons and modified kernel. The measurement results show that the attachment time and handover time are short enough for practical use. TCP throughput in MocLis is faster than that in the tunneling based approaches.

[Academic Conference]IEEE International Workshop on Signal Processing Advances in Wireless Communications(SPAWC)
[Theme]Non-Line-of-Sight Positioning for Mmwave Communications

Using information about the wireless communication channel is a well known approach to estimate a users position. So far it has been shown that such methods can provide positioning information in line-of-sight (LOS) situations by estimating channel properties like time of flight, direction of arrival, and direction of departure of a link between a single access point and station. In this paper we focus on mm Wave channels and propose a method that allows positioning in indoor scenarios even under non-line-of-sight conditions by exploiting the presence of scatterers. Further, we propose an approach to overcome the need for an angular reference which is usually required to perform measurements of direction of arrival/departure and, therefore, limits practical applications. We investigate the influence of noisy temporal and spatial measurements on achievable performance with and without presence of an angular reference. Results show that in presence of an angular reference, positioning with the proposed method is possible with an accuracy lower than 4 cm in 50 % of observations and decreases to 8 cm without an angular reference.

more
Category Network & Data Analytics
Name F. Fellhauer(University of Stuttgart-Sony EuTEC Contractor),
N. Loghin (EuTEC),
J. Lassen,
A. Jaber (University of Stuttgart, Students)

Using information about the wireless communication channel is a well known approach to estimate a users position. So far it has been shown that such methods can provide positioning information in line-of-sight (LOS) situations by estimating channel properties like time of flight, direction of arrival, and direction of departure of a link between a single access point and station. In this paper we focus on mm Wave channels and propose a method that allows positioning in indoor scenarios even under non-line-of-sight conditions by exploiting the presence of scatterers. Further, we propose an approach to overcome the need for an angular reference which is usually required to perform measurements of direction of arrival/departure and, therefore, limits practical applications. We investigate the influence of noisy temporal and spatial measurements on achievable performance with and without presence of an angular reference. Results show that in presence of an angular reference, positioning with the proposed method is possible with an accuracy lower than 4 cm in 50 % of observations and decreases to 8 cm without an angular reference.

[Academic Conference]IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)
[Theme]Mode Domain Spatial Active Noise Control Using Sparse Signal Representation

Active noise control (ANC) over a sizeable space requires a large number of reference and error microphones to satisfy the spatial Nyquist sampling criterion, which limits the feasibility of practical realization of such systems. This paper proposes a mode-domain feedforward ANC method to attenuate the noise field over a large space while reducing the number of microphones required. We adopt a sparse reference signal representation to precisely calculate the reference mode coefficients. The proposed system consists of circular reference and error microphone arrays, which capture the reference noise signal and residual error signal, respectively, and a circular loudspeaker array to drive the anti-noise signal. Experimental results indicate that above the spatial Nyquist frequency,our proposed method can perform well compared to a conventional methods. Moreover, the proposed method can even reduce the number of reference microphones while achieving better noise attenuation.

more
Category Audio & Acoustics
Name Y. Maeno,
Y. Mitsufuji,
T. D. Abhayapa(ANU)

Active noise control (ANC) over a sizeable space requires a large number of reference and error microphones to satisfy the spatial Nyquist sampling criterion, which limits the feasibility of practical realization of such systems. This paper proposes a mode-domain feedforward ANC method to attenuate the noise field over a large space while reducing the number of microphones required. We adopt a sparse reference signal representation to precisely calculate the reference mode coefficients. The proposed system consists of circular reference and error microphone arrays, which capture the reference noise signal and residual error signal, respectively, and a circular loudspeaker array to drive the anti-noise signal. Experimental results indicate that above the spatial Nyquist frequency,our proposed method can perform well compared to a conventional methods. Moreover, the proposed method can even reduce the number of reference microphones while achieving better noise attenuation.

[Academic Conference]IEEE International Workshop on Acoustic Signal Enhancement (IWAENC)
[Theme]MMDenseLSTM: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation

Deep neural networks have become an indispensable technique for audio source separation (ASS). It was recently reported that a variant of CNN architecture called MMDenseNet was successfully employed to solve the ASS problem of estimating source amplitudes, and state-of-the-art results were obtained for DSD100 dataset. To further enhance MMDenseNet, here we propose a novel architecture that integrates long short-term memory (LSTM) in multiple scales with skip connections to efficiently model long-term structures within an audio context. The experimental results show that the proposed method outperforms MMDenseNet, LSTM and a blend of the two networks. The number of parameters and processing time of the proposed model are significantly less than those for simple blending. Furthermore, the proposed method yields better results than those obtained using ideal binary masks for a singing voice separation task.

more
Category AI & Machine Learning
Name N. Takahashi,
N. Goswami,
Y. Mitsufuji(Sony Corporation)

Deep neural networks have become an indispensable technique for audio source separation (ASS). It was recently reported that a variant of CNN architecture called MMDenseNet was successfully employed to solve the ASS problem of estimating source amplitudes, and state-of-the-art results were obtained for DSD100 dataset. To further enhance MMDenseNet, here we propose a novel architecture that integrates long short-term memory (LSTM) in multiple scales with skip connections to efficiently model long-term structures within an audio context. The experimental results show that the proposed method outperforms MMDenseNet, LSTM and a blend of the two networks. The number of parameters and processing time of the proposed model are significantly less than those for simple blending. Furthermore, the proposed method yields better results than those obtained using ideal binary masks for a singing voice separation task.

[Academic Conference]The 2018 Joint Workshop on Machine Learning for Music
[Theme]Improving DNN-based Music Source Separation using Phase Features

Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentally and propose a new DNN architecture which combines amplitude and phase. This joint approach achieves a better signal-to distortion ratio on the DSD100 dataset for all instruments compared to a network that uses only amplitude features. Especially, the bass instrument benefits from the phase information.

more
Category AI & Machine Learning
Name J. Muth(EPFL),
S. Uhlich,
F. Cardinaux,
Y. Mitsufuji(Sony Corporation)

Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentally and propose a new DNN architecture which combines amplitude and phase. This joint approach achieves a better signal-to distortion ratio on the DSD100 dataset for all instruments compared to a network that uses only amplitude features. Especially, the bass instrument benefits from the phase information.

[Academic Conference]AES Conference on Spatial Reproduction -Aesthetic and Science-
[Theme]Creating a Highly-Realistic "Acoustic Vessel Odyssey" Using Sound Field Synthesis with 576 Loudspeakers

“Acoustic Vessel Odyssey” is a sound installation realizing the future of music by using Sony’s spatial audio technology called Sound Field Synthesis (SFS). It enables creators to simulate popping, moving and partitioning of sounds in one space. At the “Lost In Music” event, where we demonstrated “Acoustic Vessel Odyssey”, the immersive experience provided by SFS technology was further enhanced by a new, specially designed loudspeaker array consisting of 576 loudspeakers. The content was choreographed by sound artist Evala and is accompanied by a light installation created by digital media artists Kimchi and Chips. In this paper, we present the details of the system architecture as well as technical requirements of “Acoustic Vessel Odyssey”.

more
Category Audio & Acoustics
Name Y. Mitsufuji,
A. Tomura,
K. Ohkuri(Sony Corporation)

“Acoustic Vessel Odyssey” is a sound installation realizing the future of music by using Sony’s spatial audio technology called Sound Field Synthesis (SFS). It enables creators to simulate popping, moving and partitioning of sounds in one space. At the “Lost In Music” event, where we demonstrated “Acoustic Vessel Odyssey”, the immersive experience provided by SFS technology was further enhanced by a new, specially designed loudspeaker array consisting of 576 loudspeakers. The content was choreographed by sound artist Evala and is accompanied by a light installation created by digital media artists Kimchi and Chips. In this paper, we present the details of the system architecture as well as technical requirements of “Acoustic Vessel Odyssey”.

[Academic Conference]International Speech Communication Association(Interspeech)
[Theme]PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Speech Enhancement and Audio Source Separation

Previous research on audio source separation based on deep
neural networks (DNNs) mainly focuses on estimating the magnitude
spectrum of target sources and typically, phase of the
mixture signal is combined with the estimated magnitude spectra
in an ad-hoc way. Although recovering target phase is assumed
to be important for the improvement of separation quality,
it can be difficult to handle the periodic nature of the phase
with the regression approach. Unwrapping phase is one way
to eliminate the phase discontinuity, however, it increases the
range of value along with the times of unwrapping, making it
difficult for DNNs to model. To overcome this difficulty, we
propose to treat the phase estimation problem as a classification
problem by discretizing phase values and assigning class indices
to them. Experimental results show that our classificationbased
approach 1) successfully recovers the phase of the target
source in the discretized domain, 2) improves signal-todistortion
ratio (SDR) over the regression-based approach in
both speech enhancement task and music source separation
(MSS) task, and 3) outperforms state-of-the-art MSS.
Index Terms: phase modeling, quantized phase, deep neural
networks

more
Category AI & Machine Learning
Name N. Takahashi,
P. Agrawal(IISc),
N. Goswami,
Y. Mitsufuji(Sony Corporation)

Previous research on audio source separation based on deep
neural networks (DNNs) mainly focuses on estimating the magnitude
spectrum of target sources and typically, phase of the
mixture signal is combined with the estimated magnitude spectra
in an ad-hoc way. Although recovering target phase is assumed
to be important for the improvement of separation quality,
it can be difficult to handle the periodic nature of the phase
with the regression approach. Unwrapping phase is one way
to eliminate the phase discontinuity, however, it increases the
range of value along with the times of unwrapping, making it
difficult for DNNs to model. To overcome this difficulty, we
propose to treat the phase estimation problem as a classification
problem by discretizing phase values and assigning class indices
to them. Experimental results show that our classificationbased
approach 1) successfully recovers the phase of the target
source in the discretized domain, 2) improves signal-todistortion
ratio (SDR) over the regression-based approach in
both speech enhancement task and music source separation
(MSS) task, and 3) outperforms state-of-the-art MSS.
Index Terms: phase modeling, quantized phase, deep neural
networks

[Academic Conference]IEEE International Workshop on Acoustic Signal Enhancement(IWAENC)
[Theme]Mode-Domain Spatial Active Noise Control Using Multiple Circular Arrays

Noise control and attenuation over a sizable space requires uniformly distributed microphones and loudspeakers, which limits the system’s viability in practice. In this paper, we propose a mode-domain active noise control (ANC) system using a simple microphone and loudspeaker array structure. We introduce few circular microphone and loudspeaker arrays to first transform a sound field into circular expansion mode coefficients and then combine them to calculate 3D mode coefficients, which are then processed in an adaptive algorithm to attenuate an undesired noise field in 3D space. Experimental results indicate that our proposed method gives comparable noise attenuation performance compared to a conventional method, which uses an unfeasible array structure. Furthermore, the proposed method shows better noise attenuation performance than a conventional temporal frequency domain ANC system.

more
Category Audio & Acoustics
Name Y. Maeno,
Y. Mitsufuji(Sony Corporation),
P. N. Samarasinghe,
T. D. Abhayapala(ANU)

Noise control and attenuation over a sizable space requires uniformly distributed microphones and loudspeakers, which limits the system’s viability in practice. In this paper, we propose a mode-domain active noise control (ANC) system using a simple microphone and loudspeaker array structure. We introduce few circular microphone and loudspeaker arrays to first transform a sound field into circular expansion mode coefficients and then combine them to calculate 3D mode coefficients, which are then processed in an adaptive algorithm to attenuate an undesired noise field in 3D space. Experimental results indicate that our proposed method gives comparable noise attenuation performance compared to a conventional method, which uses an unfeasible array structure. Furthermore, the proposed method shows better noise attenuation performance than a conventional temporal frequency domain ANC system.

[Academic Conference]Audio Engineering Society International convention(AES)
[Theme]Microphone Array Geometry for Two Dimensional Broadband Sound Field Recording

Sound field recording with arrays made of omnidirectional microphones suffers from an ill-conditioned problem due to the zero and small values of the spherical Bessel function. This article proposes a geometric design of a microphone array for broadband two dimensional (2D) sound field recording and reproduction. The design is parametric, with a layout having a discrete rotationally symmetric geometry composed of several geometrically similar subarrays. The actual parameters of the proposed layout can be determined for various acoustic situations to give optimized results. This design has the advantage that it simultaneously satisfies many important requirements of microphone arrays such as error robustness, operating bandwidth, and microphone unit efficiency.

more
Category Audio & Acoustics
Name W. Liao,
Y. Mitsufuji,
K. Osako,
K. Ohkuri(Sony Corporation)

Sound field recording with arrays made of omnidirectional microphones suffers from an ill-conditioned problem due to the zero and small values of the spherical Bessel function. This article proposes a geometric design of a microphone array for broadband two dimensional (2D) sound field recording and reproduction. The design is parametric, with a layout having a discrete rotationally symmetric geometry composed of several geometrically similar subarrays. The actual parameters of the proposed layout can be determined for various acoustic situations to give optimized results. This design has the advantage that it simultaneously satisfies many important requirements of microphone arrays such as error robustness, operating bandwidth, and microphone unit efficiency.

[Academic Conference]IEEE Spoken Language Technology(SLT)
[Theme]Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems

Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dialogs with speech recognition errors. In this paper, we propose a context-aware dialog response re-ranking system. Our system reranks responses in two steps: (1) it calculates matching scores for each candidate response and the current dialog context; (2) it combines the matching scores and a probability distribution of the candidates from an existing dialog system for response re-ranking. By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system on real dialogs with speech recognition errors.

more
Category AI & Machine Learning
Name J. Ohmura(Sony Corporation),
M. Eskenazi(Carnegie Mellon University)

Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dialogs with speech recognition errors. In this paper, we propose a context-aware dialog response re-ranking system. Our system reranks responses in two steps: (1) it calculates matching scores for each candidate response and the current dialog context; (2) it combines the matching scores and a probability distribution of the candidates from an existing dialog system for response re-ranking. By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system on real dialogs with speech recognition errors.

[Academic Conference]The Society for Information Display(SID)
[Theme]High-Brightness Solid-State Light Source for 4K Ultra-Short-Throw Projector

We have developed technologies for a high‐output light source consisting of blue laser diodes and a reflective phosphor wheel for next generation 4K Ultra‐Short‐Throw Projector, and have achieved a fluorescence output of 87 W. As far as we know, it is the highest fluorescence output for projectors. We adopted a newly developed phosphor cooling mechanism and an inorganic binder for high reliability of the phosphor wheel. Therefore, no deterioration in the phosphor wheel could be observed over a time period of 7,500 hours. In this paper, we report on these lightsource technologies for achieving high output and high reliability.

more
Category Device & Material
Name Y. Maeda(Sony Semiconductor Solutions Corporation)

We have developed technologies for a high‐output light source consisting of blue laser diodes and a reflective phosphor wheel for next generation 4K Ultra‐Short‐Throw Projector, and have achieved a fluorescence output of 87 W. As far as we know, it is the highest fluorescence output for projectors. We adopted a newly developed phosphor cooling mechanism and an inorganic binder for high reliability of the phosphor wheel. Therefore, no deterioration in the phosphor wheel could be observed over a time period of 7,500 hours. In this paper, we report on these lightsource technologies for achieving high output and high reliability.

[Academic Conference]The Society for Information Display(SID)
[Theme]High-Luminance Monochromatic See-Through Eyewear Display with Volume Hologram

more
Category Display & Visual
Name T. Oku(Sony Semiconductor Solutions Corporation)

[Academic Conference]The Society for Information Display(SID)
[Theme]Improvement of Light-Extraction Efficiency of a Laser-Phosphor Light Source

We investigated the laser‐phosphor light source by using inorganic phosphor wheel. We experimentally confirmed the light extraction efficiency of the inorganic phosphor wheel which is 8% higher than conventional phosphor wheel. In addition, we explain about the cause of improvement of the efficiency by showing fluorescence emission model.

more
Category Device & Material
Name H. Morita(Sony Semiconductor Solutions Corporation)

We investigated the laser‐phosphor light source by using inorganic phosphor wheel. We experimentally confirmed the light extraction efficiency of the inorganic phosphor wheel which is 8% higher than conventional phosphor wheel. In addition, we explain about the cause of improvement of the efficiency by showing fluorescence emission model.

[Academic Conference]The Society for Information Display(SID)
[Theme]Distinguished Paper: New Pixel-Driving Circuit Using Self-Discharging Compensation Method for High-Resolution OLED Microdisplays on a Silicon Backplane

A new 4T2C pixel circuit formed on a silicon substrate is proposed to realize a high‐resolution 7.8‐μm pixel pitch AMOLED microdisplay. In order to achieve high luminance uniformity, the pixel circuit compensates its Vth variation of the MOSFET for the driving transistor internally by using self‐discharging method. Also presented are 0.5‐in Quad‐VGA and 1.25‐in wide Quad‐XGA microdisplays with the proposed pixel circuit.

more
Category Display & Visual
Name K. Kimura(Sony Semiconductor Solutions Corporation)

A new 4T2C pixel circuit formed on a silicon substrate is proposed to realize a high‐resolution 7.8‐μm pixel pitch AMOLED microdisplay. In order to achieve high luminance uniformity, the pixel circuit compensates its Vth variation of the MOSFET for the driving transistor internally by using self‐discharging method. Also presented are 0.5‐in Quad‐VGA and 1.25‐in wide Quad‐XGA microdisplays with the proposed pixel circuit.

[Academic Conference]The Society for Information Display(SID)
[Theme]Distinguished Paper: 4032-ppi High-Resolution OLED Microdisplay

A 0.5 inch UXGA OLED microdisplay has been developed with 6.3μm pixel pitch. Not only 4032ppi high resolution, but high frame rate, low power consumption, wide viewing angle and high luminance have been achieved. This newly developed OLED microdisplay is suitable for Near‐to‐Eye display applications, especially electronic viewfinders.

more
Category Device & Material
Name T. Fujii(Sony Semiconductor Solutions Corporation)

A 0.5 inch UXGA OLED microdisplay has been developed with 6.3μm pixel pitch. Not only 4032ppi high resolution, but high frame rate, low power consumption, wide viewing angle and high luminance have been achieved. This newly developed OLED microdisplay is suitable for Near‐to‐Eye display applications, especially electronic viewfinders.

[Academic Conference]IEEE International Electron Devices Meeting(IEDM)
[Theme]Four-Directional Pixel-Wise Polarization CMOS Image Sensor Using Air-Gap Wire Grid on 2.5-μm Back-Illuminated Pixels

Polarization information is useful in highly functional imaging. This paper presents a four-directional pixel-wise polarization CMOS image sensor using an air-gap wire grid on 2.5-μm back-illuminated pixels. The fabricated air-gap wire grid polarizer achieved a transmittance of 63.3 % and an extinction ratio of 85 at 550 nm, outperforming conventional polarization sensors. The pixel-wise polarizers fabricated with the wafer process on back-illuminated image sensors exhibit good oblique-incidence characteristics, even with small polarization pixels of 2.5 μm. The proposed image sensor realizes mega-pixel various fusion-imaging applications, such as surface reflection reduction, highly accurate depth mapping, and condition-robust surveillance.

more
Category Imaging & Sensing
Name T. Yamazaki(Sony Semiconductor Solutions Corporation)

Polarization information is useful in highly functional imaging. This paper presents a four-directional pixel-wise polarization CMOS image sensor using an air-gap wire grid on 2.5-μm back-illuminated pixels. The fabricated air-gap wire grid polarizer achieved a transmittance of 63.3 % and an extinction ratio of 85 at 550 nm, outperforming conventional polarization sensors. The pixel-wise polarizers fabricated with the wafer process on back-illuminated image sensors exhibit good oblique-incidence characteristics, even with small polarization pixels of 2.5 μm. The proposed image sensor realizes mega-pixel various fusion-imaging applications, such as surface reflection reduction, highly accurate depth mapping, and condition-robust surveillance.

[Academic Conference]IEEE International Electron Devices Meeting(IEDM)
[Theme]Novel Stacked CMOS Image Sensor with Advanced Cu2Cu Hybrid Bonding

We have successfully mass-produced novel stacked back-illuminated CMOS image sensors (BI-CIS). In the new CIS, we introduced advanced Cu2Cu hybrid bonding that we had developed. The electrical test results showed that our highly robust Cu2Cu hybrid bonding achieved remarkable connectivity and reliability. The performance of image sensor was also investigated and our novel stacked BI-CIS showed favorable results.

more
Category Imaging & Sensing
Name Y. Kagawa(Sony Semiconductor Solutions Corporation)

We have successfully mass-produced novel stacked back-illuminated CMOS image sensors (BI-CIS). In the new CIS, we introduced advanced Cu2Cu hybrid bonding that we had developed. The electrical test results showed that our highly robust Cu2Cu hybrid bonding achieved remarkable connectivity and reliability. The performance of image sensor was also investigated and our novel stacked BI-CIS showed favorable results.

[Academic Conference]IEEE International Electron Devices Meeting(IEDM)
[Theme]Near-infrared Sensitivity Enhancement of a Back-illuminated Complementary Metal Oxide Semiconductor Image Sensor with a Pyramid Surface for Diffraction Structure

We demonstrated the near-infrared (NIR) sensitivity enhancement of back-illuminated complementary metal oxide semiconductor image sensors (BI-CIS) with a pyramid surface for diffraction (PSD) structures on crystalline silicon and deep trench isolation (DTI). The incident light diffracted on the PSD because of the strong diffraction within the substrate, resulting in a quantum efficiency of more than 30% at 850 nm. By using a special treatment process and DTI structures, without increasing the dark current, the amount of crosstalk to adjacent pixels was decreased, providing resolution equal to that of a flat structure. Testing of the prototype devices revealed that we succeeded in developing unique BI-CIS with high NIR sensitivity.

more
Category Imaging & Sensing
Name I. Oshiyama(Sony Semiconductor Solutions Corporation)

We demonstrated the near-infrared (NIR) sensitivity enhancement of back-illuminated complementary metal oxide semiconductor image sensors (BI-CIS) with a pyramid surface for diffraction (PSD) structures on crystalline silicon and deep trench isolation (DTI). The incident light diffracted on the PSD because of the strong diffraction within the substrate, resulting in a quantum efficiency of more than 30% at 850 nm. By using a special treatment process and DTI structures, without increasing the dark current, the amount of crosstalk to adjacent pixels was decreased, providing resolution equal to that of a flat structure. Testing of the prototype devices revealed that we succeeded in developing unique BI-CIS with high NIR sensitivity.

[Academic Conference]IEEE International Electron Devices Meeting(IEDM)
[Theme]An Experimental CMOS Photon Detector with 0.5e- RMS Temporal Noise and 15μm pitch Active Sensor Pixels

This is the first reported non-electron-multiplying CMOS Image Sensor (CIS) photon-detector for replacing Photo Multiplier Tubes (PMT). 15jum pitch active sensor pixels with complete charge transfer and readout noise of 0.5 e-RMS are arrayed and their digital outputs are summed to detect micro light pulses. Successful proof of radiation counting is demonstrated.

more
Category Imaging & Sensing
Name T. Nishihara(Sony Semiconductor Solutions Corporation)

This is the first reported non-electron-multiplying CMOS Image Sensor (CIS) photon-detector for replacing Photo Multiplier Tubes (PMT). 15jum pitch active sensor pixels with complete charge transfer and readout noise of 0.5 e-RMS are arrayed and their digital outputs are summed to detect micro light pulses. Successful proof of radiation counting is demonstrated.

[Academic Conference]IEEE International Electron Devices Meeting(IEDM)
[Theme]Pixel/DRAM/logic 3-layer stacked CMOS image sensor technology

We developed a CMOS image sensor (CIS) chip, which is stacked pixel/DRAM/logic. In this CIS chip, three Si substrates are bonded together, and each substrate is electrically connected by two-stacked through-silica vias (TSVs) through the CIS or dynamic random access memory (DRAM). We obtained low resistance, low leakage current, and high reliability characteristics of these TSVs. Connecting metal with TSVs through DRAM can be used as low resistance wiring for a power supply. The Si substrate of the DRAM can be thinned to 3 pm, and its memory retention and operation characteristics are sufficient for specifications after thinning. With this stacked CIS chip, it is possible to achieve less rolling shutter distortion and produce super slow motion video.

more
Category Imaging & Sensing
Name H. Tsugawa(Sony Semiconductor Solutions Corporation)

We developed a CMOS image sensor (CIS) chip, which is stacked pixel/DRAM/logic. In this CIS chip, three Si substrates are bonded together, and each substrate is electrically connected by two-stacked through-silica vias (TSVs) through the CIS or dynamic random access memory (DRAM). We obtained low resistance, low leakage current, and high reliability characteristics of these TSVs. Connecting metal with TSVs through DRAM can be used as low resistance wiring for a power supply. The Si substrate of the DRAM can be thinned to 3 pm, and its memory retention and operation characteristics are sufficient for specifications after thinning. With this stacked CIS chip, it is possible to achieve less rolling shutter distortion and produce super slow motion video.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]An 8.3M‐pixel 480fps Global‐Shutter CMOS Image Sensor with Gain‐Adaptive Column ADCs and 2‐on‐1 Stacked Device Structure

A 4K2K 480 fps global-shutter CMOS image sensor has been developed with super 35 mm format. This sensor employs newly developed gain-adaptive column ADCs to attain a dark random noise of 140 μV rms for the full-scale readout of 923 mV. An on-chip online correction of the error between two switchable gains maintains the nonlinearity of output image within 0.18 %. The 16-channel output interfaces with 4.752 Gbps/ch are implemented in 2 diced logic chips stacked on a sensor chip with 38K micro bumps.

more
Category Imaging & Sensing
Name Y. Oike(Sony Semiconductor Solutions Corporation)

A 4K2K 480 fps global-shutter CMOS image sensor has been developed with super 35 mm format. This sensor employs newly developed gain-adaptive column ADCs to attain a dark random noise of 140 μV rms for the full-scale readout of 923 mV. An on-chip online correction of the error between two switchable gains maintains the nonlinearity of output image within 0.18 %. The 16-channel output interfaces with 4.752 Gbps/ch are implemented in 2 diced logic chips stacked on a sensor chip with 38K micro bumps.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]Accelerating the Sensing World through Imaging Evolution

The evolution of CMOS image sensors (CIS) and the future prospect of a “Sensing” world utilizing advanced imaging technologies promise to improve our quality of life by sensing everything, everywhere, every time. Charge Coupled Device image sensors replaced video camera tubes, allowing the introduction of compact video cameras as consumer products. CIS now dominates the market for digital still cameras created by its predecessor and, with the advent of column-parallel ADCs and back-illuminated technologies, outperforms them. CIS’s achieve better signal to noise ratio, lower power consumption, and higher frame rate. Stacked CIS’s continue to enhance functionality and user experience in mobile devices, a market that currently comprises over several billion units per year. CIS imaging technologies promise to accelerate the progress of a sensing world by continuously improving sensitivity, extending detectable wave-lengths, and further improving depth resolution and temporal resolution.

more
Category Imaging & Sensing
Name T. Nomoto(Sony Semiconductor Solutions Corporation)

The evolution of CMOS image sensors (CIS) and the future prospect of a “Sensing” world utilizing advanced imaging technologies promise to improve our quality of life by sensing everything, everywhere, every time. Charge Coupled Device image sensors replaced video camera tubes, allowing the introduction of compact video cameras as consumer products. CIS now dominates the market for digital still cameras created by its predecessor and, with the advent of column-parallel ADCs and back-illuminated technologies, outperforms them. CIS’s achieve better signal to noise ratio, lower power consumption, and higher frame rate. Stacked CIS’s continue to enhance functionality and user experience in mobile devices, a market that currently comprises over several billion units per year. CIS imaging technologies promise to accelerate the progress of a sensing world by continuously improving sensitivity, extending detectable wave-lengths, and further improving depth resolution and temporal resolution.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]320x240 Back-Illuminated 10μm CAPD Pixels for High Speed Modulation Time-of-Flight CMOS Image Sensor

A 320×240 back-illuminated Time-of-Flight CMOS image sensor with 10μm CAPD pixels has been developed. The back-illuminated (BI) pixel structure maximizes the fill factor, allows for flexible transistor position and makes the light path independent of the metal layer. In addition, the CAPD pixel, which is optimized for high speed modulation, results in 80% modulation contrast at 100MHz modulation frequency.

more
Category Imaging & Sensing
Name Y. Kato(Sony Semiconductor Solutions Corporation)
Learn More

A 320×240 back-illuminated Time-of-Flight CMOS image sensor with 10μm CAPD pixels has been developed. The back-illuminated (BI) pixel structure maximizes the fill factor, allows for flexible transistor position and makes the light path independent of the metal layer. In addition, the CAPD pixel, which is optimized for high speed modulation, results in 80% modulation contrast at 100MHz modulation frequency.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]224-ke Saturation Signal Global Shutter CMOS Image Sensor with In-Pixel Pinned Storage and Lateral Overflow Integration Capacitor

The required incorporation of an additional in-pixel retention node for global shutter complementary metal-oxide semiconductor (CMOS) image sensors means that achieving a large saturation signal presents a challenge. This paper reports a 3.875-μm pixel single exposure global shutter CMOS image sensor with an in-pixel pinned storage (PST) and a lateral-overflow integration capacitor (LOFIC), which extends the saturation signal to 224 ke, thereby enabling the saturation signal per unit area to reach 14.9 ke/μm. This pixel can assure a large saturation signal by using a LOFIC for accumulation without degrading the image quality under dark and low illuminance conditions owing to the PST.

more
Category Imaging & Sensing
Name Y. Sakano(Sony Semiconductor Solutions Corporation)

The required incorporation of an additional in-pixel retention node for global shutter complementary metal-oxide semiconductor (CMOS) image sensors means that achieving a large saturation signal presents a challenge. This paper reports a 3.875-μm pixel single exposure global shutter CMOS image sensor with an in-pixel pinned storage (PST) and a lateral-overflow integration capacitor (LOFIC), which extends the saturation signal to 224 ke, thereby enabling the saturation signal per unit area to reach 14.9 ke/μm. This pixel can assure a large saturation signal by using a LOFIC for accumulation without degrading the image quality under dark and low illuminance conditions owing to the PST.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]A 4.1Mpix 280fps Stacked CMOS Image Sensor with Array-Parallel ADC Architecture for Region Control

A 4.1Mpix 280fps stacked CMOS image sensor with array-parallel ADC architecture is developed for region control applications. The combination of an active reset scheme and frame correlated double sampling (CDS) operation cancels Vth variation of pixel amplifier transistors and kTC noise. The sensor utilizes a floating diffusion (FD) based back-illuminated (BI) global shutter (GS) pixel with 4.2e-rms readout noise. An intelligent sensor system with face detection and high resolution region-of-interest (ROI) output is demonstrated with significantly low data bandwidth and low ADC power dissipation by utilizing a flexible area access function.

more
Category Imaging & Sensing
Name T. Takahashi(Sony Semiconductor Solutions Corporation)

A 4.1Mpix 280fps stacked CMOS image sensor with array-parallel ADC architecture is developed for region control applications. The combination of an active reset scheme and frame correlated double sampling (CDS) operation cancels Vth variation of pixel amplifier transistors and kTC noise. The sensor utilizes a floating diffusion (FD) based back-illuminated (BI) global shutter (GS) pixel with 4.2e-rms readout noise. An intelligent sensor system with face detection and high resolution region-of-interest (ROI) output is demonstrated with significantly low data bandwidth and low ADC power dissipation by utilizing a flexible area access function.

[Academic Conference]VLSI Symposia on Technology and Circuits(VLSI)
[Theme]3D integration technology for CMOS image sensors and future prospects

more
Category Imaging & Sensing
Name R. Nakamura(Sony Semiconductor Solutions Corporation)

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A 0.7V 1.5-to-2.3mW GNSS Receiver with 2.5-to-3.8dB NF in 28nm FD-SOI

We are approaching the age of IoE, in which wearable devices such as smart watches will be widespread. Sensing processors play a key role and the Global Navigation Satellite System (GNSS) is considered fundamental. Power consumption is one of the most important characteristics for such sensing processors. However, current GNSS receivers consume around 10mW [1,2] and are difficult to be embedded. GNSS receivers require high supply voltage for low-noise RF, which contributes to large power consumption. We developed 0.7V RF circuits that enable effective use of FD-SOI. Among the RF circuits, an LNA and an LPF are the key to 0.7V operation. We implemented an LNA with DC feedback using an OPAMP and an LPF that is composed of OTAs that have positive feedback as well as a mechanism for adjusting the output common-mode voltage.

more
Category System Architecture & Processor
Name K. Yamamoto(Sony Semiconductor Solutions Corporation)

We are approaching the age of IoE, in which wearable devices such as smart watches will be widespread. Sensing processors play a key role and the Global Navigation Satellite System (GNSS) is considered fundamental. Power consumption is one of the most important characteristics for such sensing processors. However, current GNSS receivers consume around 10mW [1,2] and are difficult to be embedded. GNSS receivers require high supply voltage for low-noise RF, which contributes to large power consumption. We developed 0.7V RF circuits that enable effective use of FD-SOI. Among the RF circuits, an LNA and an LPF are the key to 0.7V operation. We implemented an LNA with DC feedback using an OPAMP and an LPF that is composed of OTAs that have positive feedback as well as a mechanism for adjusting the output common-mode voltage.

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A 12Gb/s 0.9mW/Gb/s Wide-Bandwidth Injection-Type CDR in 28nm CMOS with Reference-Free Frequency Capture

The consumer electronics market demands high-speed and low-power serial data interfaces. The injection locked oscillator (ILO) based clock and data recovery (CDR) circuit [1-2], is a well-known solution for these demands. The typical solution has at least two oscillators: a master and one or more slaves. The master, a replica of the data path ILO, is part of a phase locked loop (PLL) used to correct the oscillator free-running frequency (FRF). The slave ILO phase locks to the incoming data but uses the frequency control from the master. Any FRF difference between the master and slave, such as that caused by PVT or mismatch, reduces the receiver performance. One solution to the reduced performance [3] uses burst data and corrects the FRF between bursts. However, for continuous data, injection forces the recovered clock frequency to match the incoming data rate, masking any FRF error from the frequency detector. Existing solutions [4-5] use a phase detector (PD) to measure the FRF. However, any static phase offset between the PD lock point and the ILO lock point causes the frequency control algorithm to converge incorrectly. Static phase offset can be caused by mismatch, PVT, or layout. This paper describes an ILO-type CDR, called the frequency-capturing ILO (FCILO), that eliminates the master oscillator and combines the ILO and PLL [6] type CDRs, realizing the benefit of both. The ILO gives wide bandwidth and fast locking while the PLL gives wide frequency capture range. The CDR architecture, shown in Fig 10.4.2, has a half-rate ILO, data and edge samplers making a bang-bang phase detector (BBPD), two 2:10 demuxes, and independent digital phase and frequency control. The ILO is made from current-starved inverters and driven by an edge detector. The ILO has coarse and fine frequency tuning. The strength of the unit inverter of the oscillator is adjusted for coarse tuning, keeping the normalized gain and delay constant over a wide range of frequencies. A current DAC is used for fine tuning. The edge detector shorts the ILO differential nodes together to align clock and data transitions. The BBPD outputs are used by the digital phase and frequency control to determine if ILO edges are early or late with respect to the incoming data and to correct the ILO FRF. A variable delay circuit controls the timing between data and clock inputs to the BBPD, correcting the static phase offset between the PD and ILO lock points.

more
Category System Architecture & Processor
Name T. Masuda(Sony Semiconductor Solutions Corporation)

The consumer electronics market demands high-speed and low-power serial data interfaces. The injection locked oscillator (ILO) based clock and data recovery (CDR) circuit [1-2], is a well-known solution for these demands. The typical solution has at least two oscillators: a master and one or more slaves. The master, a replica of the data path ILO, is part of a phase locked loop (PLL) used to correct the oscillator free-running frequency (FRF). The slave ILO phase locks to the incoming data but uses the frequency control from the master. Any FRF difference between the master and slave, such as that caused by PVT or mismatch, reduces the receiver performance. One solution to the reduced performance [3] uses burst data and corrects the FRF between bursts. However, for continuous data, injection forces the recovered clock frequency to match the incoming data rate, masking any FRF error from the frequency detector. Existing solutions [4-5] use a phase detector (PD) to measure the FRF. However, any static phase offset between the PD lock point and the ILO lock point causes the frequency control algorithm to converge incorrectly. Static phase offset can be caused by mismatch, PVT, or layout. This paper describes an ILO-type CDR, called the frequency-capturing ILO (FCILO), that eliminates the master oscillator and combines the ILO and PLL [6] type CDRs, realizing the benefit of both. The ILO gives wide bandwidth and fast locking while the PLL gives wide frequency capture range. The CDR architecture, shown in Fig 10.4.2, has a half-rate ILO, data and edge samplers making a bang-bang phase detector (BBPD), two 2:10 demuxes, and independent digital phase and frequency control. The ILO is made from current-starved inverters and driven by an edge detector. The ILO has coarse and fine frequency tuning. The strength of the unit inverter of the oscillator is adjusted for coarse tuning, keeping the normalized gain and delay constant over a wide range of frequencies. A current DAC is used for fine tuning. The edge detector shorts the ILO differential nodes together to align clock and data transitions. The BBPD outputs are used by the digital phase and frequency control to determine if ILO edges are early or late with respect to the incoming data and to correct the ILO FRF. A variable delay circuit controls the timing between data and clock inputs to the BBPD, correcting the static phase offset between the PD and ILO lock points.

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A 1ms High-Speed Vision Chip with 3D-Stacked 140GOPS Column-Parallel PEs for Spatio-Temporal Image Processing

High-speed vision systems that combine high-frame-rate imaging and highly parallel signal processing enable instantaneous visual feedback to rapidly control machines over human-visual-recognition speeds. Such systems also enable a reduction in circuit scale by using a fast and simple algorithm optimized for high-frame-rate processing (Sony Corporation). Previous studies on vision systems and chips [1-4] have yielded low imaging performance due to large matrix-based processing element (PE) parallelization [1-3], and low functionality of the limited-purpose column-parallel PE architecture [4], constraining vision-chip applications.

more
Category Imaging & Sensing
Name T. Yamazaki(Sony Semiconductor Solutions Corporation)

High-speed vision systems that combine high-frame-rate imaging and highly parallel signal processing enable instantaneous visual feedback to rapidly control machines over human-visual-recognition speeds. Such systems also enable a reduction in circuit scale by using a fast and simple algorithm optimized for high-frame-rate processing (Sony Corporation). Previous studies on vision systems and chips [1-4] have yielded low imaging performance due to large matrix-based processing element (PE) parallelization [1-3], and low functionality of the limited-purpose column-parallel PE architecture [4], constraining vision-chip applications.

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A 1/2.3in 20Mpixel 3-Layer Stacked CMOS Image Sensor with DRAM

In recent years, the performance of cellphone cameras has improved, and is becoming comparable to that of SLR cameras. However, the big difference between cellphone cameras and SLR cameras is the distortion due to the rolling exposure of CMOS image sensors (CISs) because cellphone cameras cannot have a mechanical shutters (Sony Corporation). In addition to this technical problem, the demands for high quality in dark situations and for movies are increasing. Frame-level signal processing can solve these problems, but previous generations of CIS could not achieve both high-speed readout and accessible I/F speed. This paper presents 3-layer-stacked back-illuminated CMOS Image Sensor (3L-BI-CIS) with mounted DRAM as the frame memory.

more
Category Imaging & Sensing
Name T. Haruta(Sony Semiconductor Solutions Corporation)
Learn More

In recent years, the performance of cellphone cameras has improved, and is becoming comparable to that of SLR cameras. However, the big difference between cellphone cameras and SLR cameras is the distortion due to the rolling exposure of CMOS image sensors (CISs) because cellphone cameras cannot have a mechanical shutters (Sony Corporation). In addition to this technical problem, the demands for high quality in dark situations and for movies are increasing. Frame-level signal processing can solve these problems, but previous generations of CIS could not achieve both high-speed readout and accessible I/F speed. This paper presents 3-layer-stacked back-illuminated CMOS Image Sensor (3L-BI-CIS) with mounted DRAM as the frame memory.

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A 1/4-inch 3.9Mpixel Low-Power Event-Driven Back-Illuminated Stacked CMOS Image Sensor

Wireless products such as smart home-security cameras, intelligent agents, and virtual personal assistants, are evolving rapidly to satisfy our needs. Small size, extended battery life, transparent machine interfaces: all these are required of the camera system in these applications. These applications, in battery-limited environments, can profit from an event-driven approach for moving-object detection. This paper presents a 1/4-inch 3.9Mpixel low-power event-driven (ED) back-illuminated stacked CMOS image sensor (CIS) deployed with a pixel readout circuit that detects moving objects for each pixel under lighting conditions ranging from 1 to 64,000lux. Utilizing pixel summation in a shared floating diffusion (FD) for each pixel block, moving object detection is realized at 10 frames per second while consuming only 1.1mW, a 99% reduction in power from the same CIS at a full-resolution 60fps power of 95mW.

more
Category Imaging & Sensing
Name O. Kumagai(Sony Semiconductor Solutions Corporation)

Wireless products such as smart home-security cameras, intelligent agents, and virtual personal assistants, are evolving rapidly to satisfy our needs. Small size, extended battery life, transparent machine interfaces: all these are required of the camera system in these applications. These applications, in battery-limited environments, can profit from an event-driven approach for moving-object detection. This paper presents a 1/4-inch 3.9Mpixel low-power event-driven (ED) back-illuminated stacked CMOS image sensor (CIS) deployed with a pixel readout circuit that detects moving objects for each pixel under lighting conditions ranging from 1 to 64,000lux. Utilizing pixel summation in a shared floating diffusion (FD) for each pixel block, moving object detection is realized at 10 frames per second while consuming only 1.1mW, a 99% reduction in power from the same CIS at a full-resolution 60fps power of 95mW.

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]A Back-Illuminated Global-Shutter CMOS Image Sensor with Pixel-Parallel 14b Subthreshold ADC

Rolling-shutter CMOS image sensors (CISs) are widely used [1,2]. However, the distortion of moving subjects remains an unresolved problem, regardless of the speed at which these sensors are operated. It has been reported that by adopting in-pixel analog memory (MEM) in pixels, a global shutter (GS) can be achieved by saving all pixels simultaneously as stored charges [3,4]. However, as signals from a storage unit are read in a column-wise sequence, a light-shielding structure is required for the MEM to suppress the influence of parasitic light during the reading period. Pixel-parallel ADCs have been reported as methods of implementing GS on a circuit [5,6]. However, these techniques have not been successful in operations on megapixels because they do not address issues such as the timing constraint for reading and writing a digital signal to and from an ADC in a pixel owing to increase in the number of pixels and the increase in the total power consumption of massively parallel comparators (CMs).

more
Category Imaging & Sensing
Name M. Sakakibara(Sony Semiconductor Solutions Corporation)
Learn More

Rolling-shutter CMOS image sensors (CISs) are widely used [1,2]. However, the distortion of moving subjects remains an unresolved problem, regardless of the speed at which these sensors are operated. It has been reported that by adopting in-pixel analog memory (MEM) in pixels, a global shutter (GS) can be achieved by saving all pixels simultaneously as stored charges [3,4]. However, as signals from a storage unit are read in a column-wise sequence, a light-shielding structure is required for the MEM to suppress the influence of parasitic light during the reading period. Pixel-parallel ADCs have been reported as methods of implementing GS on a circuit [5,6]. However, these techniques have not been successful in operations on megapixels because they do not address issues such as the timing constraint for reading and writing a digital signal to and from an ADC in a pixel owing to increase in the number of pixels and the increase in the total power consumption of massively parallel comparators (CMs).

[Academic Conference]International Solid-State Circuits Conference(ISSCC)
[Theme]Compressive Imaging for CMOS Image Sensors

more
Category Imaging & Sensing
Name Y. Oike(Sony Semiconductor Solutions Corporation)

[Academic Conference]2017 Primetime Emmy Engineering Award
[Theme]Development of the High Efficiency Video Coding (HEVC) standard
[Commendation institutions]]The Academy of Television Arts & Sciences

The development of High Efficiency Video Coding (HEVC) has enabled efficient delivery in ultra-high-definition (UHD) content over multiple distribution channels. This new compression coding has been adopted, or selected for adoption, by all UHD television distribution channels, including terrestrial, satellite, cable, fiber and wireless, as well as all UHD viewing devices, including traditional televisions, tablets and mobile phones.

more
Category Display & Visual
Name Teruhiko Suzuki
Learn More

The development of High Efficiency Video Coding (HEVC) has enabled efficient delivery in ultra-high-definition (UHD) content over multiple distribution channels. This new compression coding has been adopted, or selected for adoption, by all UHD television distribution channels, including terrestrial, satellite, cable, fiber and wireless, as well as all UHD viewing devices, including traditional televisions, tablets and mobile phones.

Page Top