页面树结构

2017-11-09 ApacheCN 开源组织,第二期邀请成员活动,一起走的更远 : http://www.apachecn.org/member/209.html


MachineLearning 优酷地址 : http://i.youku.com/apachecn

转至元数据结尾
转至元数据起始

FAQ

Maven

Repository

Dependency “org.apache.hadoop:hadoop-client:2.6.0-chd5.8.2”not found

作者 : 胖哥

时间 : 2016-10-13 14:18:21

场景 : 添加Dependency以后提示找不到源 

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.0-cdh5.8.0</version>
</dependency>

 

解决 : 添加Repository节点

 <repositories>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>

YARN

ResourceManager

error: <class 'xmlrpclib.Fault'>, <Fault 92: 'CANT_REREAD: The directory named as part of the path /var/run/cloudera-scm-agent/process/1543-yarn-RESOURCEMANAGER/logs/stdout.log does not exist.'>: file: /usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/supervisorctl.py line: 947

作者 : 那伊抹微笑

时间 : 2016-10-10 16:59:21

场景 : ResourceManager 挂掉,原因未知,使用 Cloudera Manager 重启该角色失败,该角色日志中有一段错误 

java.lang.IllegalArgumentException: count cannot be negative: -2147483648
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
        at com.google.common.collect.Multisets.checkNonnegative(Multisets.java:943)
        at com.google.common.collect.AbstractMapBasedMultiset.setCount(AbstractMapBasedMultiset.java:277)
        at com.google.common.collect.HashMultiset.setCount(HashMultiset.java:34)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addSchedulingOpportunity(SchedulerApplicationAttempt.java:510)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:652)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:865)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:328)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:241)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1126)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1048)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:295)

然而并没有什么卵用!~

继续使用 Cloudera Manager 重启该角色,该日志不再输出(说明还没到启动这个角色之前就已经失败了)。

那么就得查看 /var/log//cloudera-scm-agent/cloudera-scm-agent.out 这里面的日志了,果不其然这里报错了 : 

error: <class 'xmlrpclib.Fault'>, <Fault 92: 'CANT_REREAD: The directory named as part of the path /var/run/cloudera-scm-agent/process/1543-yarn-RESOURCEMANAGER/logs/stdout.log does not exist.'>: file: /usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/supervisorctl.py line: 947

卧槽,直接被干蒙了。我没有动过这台服务器上的东西呀,怎么会这样?先不管理,继续重启 、、、然后特么的还是这个错误。

既然如果那么我手动创建这个目录吧!

mkdir -p /var/run/cloudera-scm-agent/process/1543-yarn-RESOURCEMANAGER/logs/
chmod 777 /var/run/cloudera-scm-agent/process/1543-yarn-RESOURCEMANAGER/logs/

然后接着继续重启该 ResourceManager 角色了,这次报错还是跟上面那个错一样,只是目录不一样了,那么我接着创建目录,如此这般反复几次之后,特么竟然正常启动了,一切正常!~

解决 : 创建需要的目录,并且给予权限,反复几次之后就可以了,查看服务一切正常。

 

 

Spark

DataFrame、DataSet、Spark SQL

org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_corrupt_record];

作者 : 布丁

时间 : 2016-12-13 16:59:21

场景 : JSON数据创建DateSet失败

异常信息:

org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_corrupt_record];
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:350)

 

测试数据:

a数据(a.json):

[{"name": "Michael","age": ""},{"name": "Andy","age": "30"},{"name": "Justin","age": "19"}]

 

b数据(b.json):

[{"name": Michael,"age": null},{"name": Andy,"age": 30},{"name": Justin,"age": 19}]

 

测试程序:

a程序:

//起始点 : SparkSession
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL Example").config("spark.some.config.option", "some-value").getOrCreate()
//创建 DataFrame
val df = spark.read.json("/user/hive/spark2.0.1/testdata/person.json")
df.show()
//无类型 Dataset 操作(aka DataFrame 操作)
import spark.implicits._
df.printSchema()
df.select("name").show()
df.select($"name", $"age" + 1).show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()

 

b程序:

case class Person(name: String, age: Long)
val path = "/user/hive/spark2.0.1/testdata/person.json"
val peopleDS = spark.read.json(sc.wholeTextFiles(path).values).as[Person]
peopleDS.show()

 

问题描述:

使用a程序操作a.json和b.json均无问题。
使用b程序操作a.json会抛出此异常。