英语翻译4.3 ClassificationSample ClassificationOften,data sets consist of samples that belong to several different groups or “classes.”Different classes may be the result of samples prepared with different raw material (possibly from differen

来源:学生作业帮助网 编辑:六六作业网 时间:2024/12/24 00:21:03
英语翻译4.3ClassificationSampleClassificationOften,datasetsconsistofsamplesthatbelongtoseveraldifferentg

英语翻译4.3 ClassificationSample ClassificationOften,data sets consist of samples that belong to several different groups or “classes.”Different classes may be the result of samples prepared with different raw material (possibly from differen
英语翻译
4.3 Classification
Sample Classification
Often,data sets consist of samples that belong to several different groups or “classes.”
Different classes may be the result of samples prepared with different raw material (possibly from different vendors),class of chemical compound (aromatic,aliphatic,carbonyl,etc.),or process state (startup,normal,particular faults,etc.).A variety of methods have been developed for classifying samples based on measured responses.
Methods that attempt to identify groups or classes without using pre-established class
memberships are known as cluster analysis or unsupervised pattern recognition.Methods
that use known class memberships are generally called classification or supervised pattern recognition.As we will demonstrate,classification of new unknown samples can be accomplished manually using unsupervised methods or automated using supervised
techniques.
Distance Measures in Cluster Analysis and Classification
Most cluster analysis methods are based on the assumption that samples that are close
together in the measurement space are similar and therefore likely to belong to the same
class.There are however,many ways of defining distance between samples.The most
common is simple Euclidean distance,where the distance dij between samples xi and xj is
defined as
dij = xi − xj ( )xi − x j ( )T (93)
which is simply the square root of the sum of squared differences between the samples.
Preprocessing such as autoscaling is often used prior to calculating the distance.Distance
may also be defined based on PCA scores instead of on the raw data.This is in essence a
noise-filtered estimate of the distance as the deviations between samples in dimensions not
included in the model are not considered.In this case,the distance dij between samples xi and
xj with scores ti and tj (note that ti and tj are 1 by k vectors containing the scores on all PCs
of the model for samples xi and xj) is defined as
dij = ti − t j ( )ti − t j ( )T (94)
It is also possible to define the distance based on principal component scores adjusted to unit
variance.This is analogous to the T2 statistic give in equation 7.In this case the distance in
equation 94 is weighted by the inverse of the eigenvalues,λ:
dij = ti − t j ( )λ −1 ti − t j ( )T (95)
The distance measure defined in equation 95 is one type of Mahalanobis distance.A
Mahalanobis distance accounts for the fact that,in many data sets,variation in some

英语翻译4.3 ClassificationSample ClassificationOften,data sets consist of samples that belong to several different groups or “classes.”Different classes may be the result of samples prepared with different raw material (possibly from differen
4.3分类
样本分类
很多时候,数据集构成的样本属于几个不同的群体或“班” .
不同阶层可能是由于样本准备与不同的原料(可能来自不同供应商) ,类化合物(芳香族,脂肪,羰基等) ,或工艺状态(启动,正常,特别是故障等) .多种方式,已开发划分为样本,基于实测的反应.
的方法,尝试找出团体或班级,不使用预先确定的阶层
会员资格被称为聚类分析或无监督模式识别.方法
使用被称为一流的会员资格,一般所谓的分类或监督的模式识别.我们将证明,新的分类未知样品可以完成手动使用监督的方法,或自动使用的监督
技术.
距离措施,在聚类分析和分类
大多数的聚类分析方法是基于这样的假设样本是密切
一起在测量空间类似,因此很可能属于同一
一流的.有然而,许多的方式确定样本之间的距离.最
常见的是简单的欧氏距离,而距离dij之间的样本,第十一和的XJ是
定义为
dij =席-的X J( )喜- x值J (下)笔( 9 3 )
这是简单的平方根的总和平方之间的分歧样本.
预处理,如autoscaling是经常使用之前计算的距离.距离
也可能加以界定基于PCA分数,而非对原始数据.这是在本质上是一项
噪音过滤估计的距离作为样本之间的偏差在尺寸不
包括在该模型没有考虑.在这种情况下,距离dij之间的样本,第十一和
与分数的XJ Ti和TJ音乐(请注意,TI和TJ音乐是一用K载体载有分数上所有的PC
该模型的样本和第十一的XJ )的定义为
dij =钛-吨J (下)钛-吨J (下)笔( 9 4 )
它也能够确定的距离,基于主成分分数调整,以股
差额.这是类似于T2合资格的统计给予在方程7 .在这种情况下,在距离
方程94是加权由逆的特征值,λ :
dij =钛-吨J (下) λ - 1钛-吨J (下)笔( 9 5 )
距离措施的定义,方程95是一种类型的马氏距离.字母a
马氏距离占的事实,在许多数据集,在一些变化.