zl程序教程

您现在的位置是:首页 >  其他

当前栏目

Spark修炼之道(进阶篇)——Spark入门到精通:第十节 Spark SQL案例实战(一)

案例SQLSpark入门 实战 精通 之道 修炼
2023-09-14 09:00:24 时间

本文通过将github上的Spark项目git日志作为数据,对SparkSQL的内容进行详细介绍
数据获取命令如下:


[root@master spark]# git log --pretty=format:{"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"} sparktest.json

格式化日志内容输出如下:


[root@master spark]# head -1 sparktest.json

{"commit":"30b706b7b36482921ec04145a0121ca147984fa8","author":"Josh Rosen","author_email":"joshrosen@databricks.com","date":"Fri Nov 6 18:17:34 2015 -0800","message":"SPARK-11389-CORE-Add-support-for-off-heap-memory-to-MemoryManager"}

然后使用命令将sparktest.json文件上传到HDFS上


scala val df = sqlContext.read.json("/data/sparktest.json")

16/02/05 09:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver

查看其模式:


+----------------+--------------------+--------------------+--------------------+--------------------+ | author| author_email| commit| date| message| +----------------+--------------------+--------------------+--------------------+--------------------+ | Josh Rosen|joshrosen@databri...|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...| |Michael Armbrust|michael@databrick...|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...| +----------------+--------------------+--------------------+--------------------+--------------------+

(2)计算总提交次数


scala sqlContext.sql("SELECT * FROM commitlog").show(2)

+----------------+--------------------+--------------------+--------------------+--------------------+

| author| author_email| commit| date| message|

+----------------+--------------------+--------------------+--------------------+--------------------+

| Josh Rosen|joshrosen@databri...|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...|

|Michael Armbrust|michael@databrick...|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|

+----------------+--------------------+--------------------+--------------------+--------------------+

(2)计算总提交次数


scala sqlContext.sql("SELECT count(*) as TotalCommitNumber FROM commitlog").show

+-----------------+

|TotalCommitNumber|

+-----------------+

| 13507|

+-----------------+

(3)按提交次数进行降序排序


scala sqlContext.sql("SELECT author,count(*) as CountNumber FROM commitlog GROUP BY author ORDER BY CountNumber DESC").show

+--------------------+-----------+

| author|CountNumber|

+--------------------+-----------+

| Matei Zaharia| 1590|

| Reynold Xin| 1071|

| Patrick Wendell| 857|

| Tathagata Das| 416|

| Josh Rosen| 348|

| Mosharaf Chowdhury| 290|

| Andrew Or| 287|

| Xiangrui Meng| 285|

| Davies Liu| 281|

| Ankur Dave| 265|

| Cheng Lian| 251|

| Michael Armbrust| 243|

| zsxwing| 200|

| Sean Owen| 197|

| Prashant Sharma| 186|

| Joseph E. Gonzalez| 185|

| Yin Huai| 177|

|Shivaram Venkatar...| 173|

| Aaron Davidson| 164|

| Marcelo Vanzin| 142|

+--------------------+-----------+

更多复杂的玩法,大家可以自己去尝试,这里给出的只是DataFrame方法与临时表SQL语句的用法差异,以便于有整体的认知。


Redis(一)入门:NoSQL OR SQL,看完这篇你就懂了 非结构数据,根据定义是指数据结构不规则或不完整,没有任何预定义的数据模型,不方便用二维逻辑表来表现数据,例如网页日志、文本文档、图像、视频和音频文件等。
第十二届 BigData NoSQL Meetup — 基于hbase的New sql落地实践 立即下载