Introduction to supervised sequence labelling with RNN

第1段落

(高橋くんの訳に基づく)

In machine learning, the term sequence labelling encompasses all tasks where sequences of data are transcribed with sequences of discrete labels.
機械学習において、シーケンス・ラベリング(時系列データラベル付け)という用語は、一連のデータ列が一連のラベル列に変換されるという仕事すべてを表すものである。
Well-known examples include speech and handwriting recognition, protein secondary structure prediction and part-of-speech tagging.
　よく知られた例には、音声認識や手書き文字認識、タンパク質二次構造予測、品詞のタグ付けがある。
Supervised sequence labelling refers specically to those cases where a set of hand-transcribed sequences is provided for algorithm training.
教師あり(学習)のシーケンス・ラベリングとは、特に手書き文字認識をアルゴリズム学習する場合を指す。
What distinguishes such problems from the traditional framework of supervised pattern classication is that the individual data points cannot be assumed to be independent.
伝統的な枠組みの教師ありパタン認識とこのような問題との違いは、個々のデータ点が互いに独立であるとは仮定できないことにある。
Instead, both the inputs and the labels form strongly correlated sequences.
(独立と仮定できないどころか)入力もラベルも強い相関があるシーケンスである。
In speech recognition for example, the input (a speech signal) is produced by the continuous motion of the vocal tract, while the labels (a sequence of words) are mutually constrained by the laws of syntax and grammar.
例えば音声認識では、入力(音声信号)は声道の連続的な運動によって生成され、ラベル(単語列)は構文構造や文法規則によって相互に制約される。
A further complication is that in many cases the alignment between inputs and labels is unknown.
さらに複雑なことに、入力とラベルの対応関係が不明なことが多い。
This requires the use of algorithms able to determine the location as well as the identity of the output labels.
このことにより、出力ラベルの値だけではなく位置をも決定できるようなアルゴリズムの使用が必要になる。

第2段落

(寺西くんの訳に基づく)

Recurrent neural networks (RNNs) are a class of articial neural network architecture that --- inspired by the cyclical connectivity of neurons in the brain --- uses iterative function loops to store information.
リカレント(再帰)ニューラルネットワーク(RNN)は、実際の脳におけるニューロンの循環的な結合構造にヒントを得て、反復関数ループを用いて情報を格納するという種類の人工ニューラルネットワーク・アーキテクチャである。
RNNs have several properties that make them an attractive choice for sequence labelling: they are flexible in their use of context information (because they can learn what to store and what to ignore); they accept many dierent types and representations of data; and they can recognise sequential patterns in the presence of sequential distortions.
RNNには時系列データラベリング問題を扱うのに魅力的な性質がいくつかある: (なぜなら何を記憶し何を無視するかを学習できるから)文脈情報の利用において柔軟性がある、いろいろな種類のデータやいろいろなデータ表現を受け入れる、そして連続的な歪みがあっても連続パタンを認識することができる。
However they also have several drawbacks that have limited their application to real-world sequence labelling problems.
しかしながら、RNNには実際のシーケンス・ラベリング問題への適用を制限するいくつかの欠点も持ち合わせている。

第3段落

(寺西くんの訳に基づく)

Perhaps the most serious flaw of standard RNNs is that it is very difficult to get them to store information for long periods of time (Hochreiter et al., 2001b).
おそらく標準的なRNNのもっとも深刻な欠点は、長時間にわたって情報を保管することが非常に難しいということである(Hochreiter et al., 2001b) 。
This limits the range of context they can access, which is of critical importance to sequence labelling.
これによりアクセス可能な文脈の範囲が制限され、このことはシーケンス・ラベリング処理にとって非常に重要な問題となる。
Long Short-Term Memory (LSTM; Hochreiter and Schmidhuber, 1997) is a redesign of the RNN architecture around special `memory cell' units.
長短期記憶(LSTM; Hochreiter and Schmidhuber, 1997)は、RNNアーキテクチャに対し、特殊な「記憶細胞」について再設計を施したものである。
In various synthetic tasks, LSTM has been shown capable of storing and accessing information over very long timespans (Gers et al., 2002; Gers and Schmidhuber, 2001).
いろいろな統合問題において、LSTMはとても長期間情報を格納しアクセすることができることを示してきた (Gers et al., 2002; Gers and Schmidhuber, 2001)。
It has also proved advantageous in real-world domains such as speech processing (Graves and Schmidhuber, 2005b) and bioinformatics (Hochreiter et al., 2007).
LSTMはまた、音声処理(Graves and Schmidhuber, 2005b)や生物情報学(Hochreiter et al., 2007)のような現実の問題においても利点があることを示してきた。
LSTM is therefore the architecture of choice throughout the book.
それゆえ本書ではアーキテクチャとしてLSTMを採用する。

第4段落

(田原くんの訳に基づく)

Another issue with the standard RNN architecture is that it can only access contextual information in one direction (typically the past, if the sequence is temporal).
標準的なRNNアーキテクチャのもうひとつの問題は、一方向にしか文脈情報にアクセスできないということである (データがシーケンスであった場合は、典型的には過去から未来への方向のみ)
This makes perfect sense for time-series prediction, but for sequence labelling it is usually advantageous to exploit the context on both sides of the labels.
このことはシーケンスに基づく予測の場合は全く理にかなったことであるが、シーケンスのラベル付けの場合、ラベル付け対象の両側の文脈を使えることは普通利点がある。
Bidirectional RNNs (Schuster and Paliwal, 1997) scan the data forwards and backwards with two separate recurrent layers, thereby removing the asymmetry between input directions and providing access to all surrounding context.
双方向RNN(Schuster & Paliwal, 1997)では２つ別々の再帰層を用いて前向きと後ろ向きのそれぞれでデータを走査し、それによりそれにより入力方向における非対称性を除去しつつ、周囲の文脈すべてへのアクセスを提供している。
Bidirectional LSTM (Graves and Schmidhuber, 2005b) combines the benefits of long-range memory and bidirectional processing.
双方向LSTM(Graves & Schmidhuber, 2005b) は長期記憶と双方向処理という利点を組み合わせたものである。

第5段落

(山田真司くんの訳に基づく)

For tasks such as speech recognition, where the alignment between the inputs and the labels is unknown, RNNs have so far been limited to an auxiliary role.
入力とラベルとの間の位置合わせが未知である音声認識のようなタスクの場合、RNNの役割はこれまで補助的なものに限定されていた。
The problem is that the standard training methods require a separate target for every input, which is usually not available.
問題は、標準的な学習方法では、すべての入力に対しそれぞれ個別の目標値が必要であり、普通ならそのような目標値が入手できないということである。
The traditional solution --- the so-called hybrid approach --- is to use hidden Markov models to generate targets for the RNN, then invert the RNN outputs to provide observation probabilities (Bourlard and Morgan, 1994).
伝統的な解決方法、いわゆるハイブリッドアプローチでは、隠れマルコフモデル(HMM)を使用してRNNの目標値を生成し、次にRNN出力から観測確率を得るというものであった（Bourlard and Morgan、1994）。
However the hybrid approach does not exploit the full potential of RNNs for sequence processing, and it also leads to an awkward combination of discriminative and generative training.
しかしながら、ハイブリッドアプローチでは、時系列データ処理のためのRNNの可能性をフルに利用しておらず、また弁別学習や生成学習を組み合わせるという厄介な問題をもたらすものである。
The connectionist temporal classication (CTC) output layer (Graves et al., 2006) removes the need for hidden Markov models by directly training RNNs to label sequences with unknown alignments, using a single discriminative loss function.
コネクショニスト時間分類(CTC）の出力層(Graves et al.,2006)は、隠れマルコフモデルなしに弁別損失関数一つだけを用いRNNを直接学習させることによって、シーケンスに対する未知のラベル問題を解決する。
CTC can also be combined with probabilistic language models for word-level speech and handwriting recognition.
CTCはまた単語レベルの音声認識や手書き文字認識のための確率的言語モデルと組み合わせることもできる。

第6段落

(前田くんの訳に基づく)

Recurrent neural networks were designed for one-dimensional sequences.
リカレント(再帰)ニューラルネットワークは１次元シーケンス用に設計されたものである。
However some of their properties, such as robustness to warping and flexible use of context, are also desirable in multidimensional domains like image and video processing.
しかしながら、ワーピング(歪み)に対する頑健性や文脈の柔軟な使用のようないくつかの特性は画像処理や動画処理のような多次元領域においても望ましいものである。
Multidimensional RNNs, a special case of directed acyclic graph RNNs (Baldi and Pollastri, 2003), generalise to multidimensional data by replacing the one-dimensional chain of network updates with an n-dimensional grid.
有向非循環グラフ(directed acyclic graph)RNN（Baldi and Pollastri、2003)の特殊なケースである多次元RNNは、1次元的に行われるネットワーク更新を n次元グリッド的に行うようにすることにより多次元データに一般化している。
Multidimensional LSTM (Graves et al., 2007) brings the improved memory of LSTM to multidimensional networks.
多次元LSTM（Graves et al., 2007）は、LSTMの記憶構造を改良して多次元ネットワークに適用したものである。

第7段落

(山田翔くんの訳に基づく)

Even with the LSTM architecture, RNNs tend to struggle with very long data sequences.
LSTMアーキテクチャであっても、RNNは非常に長いシーケンスに対しては苦労する傾向がある。
As well as placing increased demands on the network's memory, such sequences can be be prohibitively time-consuming to process.
このようなデータに対しては、ネットワーク記憶容量に対する要求が増大するだけでなく、処理に時間がかかることもある。
The problem is especially acute for multidimensional data such as images or videos, where the volume of input information can be enormous.
この問題は特に、入力情報の量が膨大になる画像やビデオなどの多次元データにとって深刻になる。
Hierarchical subsampling RNNs (Graves and Schmidhuber, 2009) contain a stack of recurrent network layers with progressively lower spatiotemporal resolution.
　階層サブサンプリングRNN（Graves and Schmidhuber、2009）には、時空間解像度が徐々に低下するように構成された再帰的ネットワーク層がある。
As long as the reduction in resolution is large enough, and the layers at the bottom of the hierarchy are small enough, this approach can be made computationally efficient for almost any size of sequence.
解像度の低下が十分に大きく、階層の最下部の層が十分に小さい限り、この手法によりほぼあらゆるサイズのシーケンスに対して計算効率を良くすることができる。
Furthermore, because the effective distance between the inputs decreases as the information moves up the hierarchy, the network's memory requirements are reduced.
さらに、情報が上の階層に伝播するにつれて、入力間の実効的な距離が減少するため、ネットワークの記憶容量に対する要求が減少する。

第8段落

(若松くんの訳に基づく)

The combination of multidimensional LSTM, CTC output layers and hierarchical subsampling leads to a general-purpose sequence labelling system entirely constructed out of recurrent neural networks.
多次元LSTM、CTC出力層、階層的サブサンプリングの組み合わせは、再帰ニューラルネットワーク(RNN)から汎用のシーケンス・ラベル付けシステムへと繋がる。
The system is flexible, and can be applied with minimal adaptation to a wide range of data and tasks.
このシステムは柔軟であり、幅広いデータやタスクに対して最小限の調整で適用できる。
It is also powerful, as this book will demonstrate with state-of-the-art results in speech and handwriting recognition.
本書で示すように、音声認識と手書き文字認識において最先端の技術であり、強力なシステムでもある。

1.1 本書の構成

(小板くんの訳に基づく)

The chapters are roughly grouped into three parts: background material is presented in Chapters 2–4, Chapters 5 and 6 are primarily experimental, and new methods are introduced in Chapters 7–9.
章は大まかに３つの部分にわけられている:　第２章から４章までは研究背景となる資料を示し、第５章と６章は主要な実験について、第７章から９章では新しい手法の紹介である。
Chapter 2 briefly reviews supervised learning in general, and pattern classification in particular.
第２章では広く教師あり学習について、そして特にパタンクラス分類について概説する。
It also provides a formal definition of sequence labelling, and discusses three classes of sequence labelling task that arise under different relationships between the input and label sequences.
また「シーケンスラベル付け」の形式的定義を与え、入力シーケンスとラベル・シーケンスの関係から生じる３種類のシーケンスラベル付問題について議論する。
Chapter 3 provides background material for feedforward and recurrent neural networks, with emphasis on their application to labelling and classification tasks.
第３章ではフィードフォワード(順伝播)とリカレント(再帰)ニューラルネットワークについての基礎知識を提示し、ラベル付けや分類問題への応用に焦点をあてる。
It also introduces the sequential Jacobian as a tool for analysing the use of context by RNNs.
またRNNによる文脈の仕様を分析するためのツールとしてシーケンシャルJacobianを紹介する。
Chapter 4 describes the LSTM architecture and introduces bidirectional LSTM (BLSTM).
第４章ではLSTMアーキテクチャを説明し、双方向のLSTM(BLSTM)を紹介する。
Chapter 5 contains an experimental comparison of BLSTM to other neural network architectures applied to framewise phoneme classification.
第５章はBLSTMと、フレームごとの音素分類に適用したニューラルネット・アーキテクチャとを実験比較する。
Chapter 6 investigates the use of LSTM in hidden Markov model-neural network hybrids. Chapter 7 introduces connectionist temporal classification, Chapter 8 covers multidimensional networks, and hierarchical subsampling networks are described in Chapter 9.

第６章では隠れマルコフモデルとニューラルネットワークのハイブリッドにおけるLSTMの使用法を調べる。第７章ではCTC(コネクショニスト時間分類)を紹介し、第８章では多次元ネットワークを取り上げ、第９章で階層的サブサンプリングネットワークについて説明する。