dataframada 1 dan n gacha bo'lgan qiymatlarni o'z ichiga olgan bitta ustun qo'shing

Quyidagi kabi pizpark bilan dataframe yarataman:

+----+------+
|   k|     v|
+----+------+
|key1|value1|
|key1|value1|
|key1|value1|
|key2|value1|
|key2|value1|
|key2|value1|
+----+------+

"WithColumn" usulidan foydalanib, "rowNum" ustunini qo'shmoqchiman, dataframe natijasi shu tarzda o'zgardi:

+----+------+------+
|   k|     v|rowNum|
+----+------+------+
|key1|value1|     1|
|key1|value1|     2|
|key1|value1|     3|
|key2|value1|     4|
|key2|value1|     5|
|key2|value1|     6|
+----+------+------+

rowNum qatori 1 dan n, n nuktalar soniga teng. Men kodimni quyidagi kabi o'zgartirdim:

from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy("v").orderBy('k')
my_df= my_df.withColumn("rowNum", F.rowNumber().over(w))

Lekin xato xabari bor:

'module' object has no attribute 'rowNumber' 

Row_number bilan rowNumber() uslubini o'zgartirdim, yuqoridagi kod ishlashi mumkin. Lekin, kodni ishga tushirganimda:

my_df.show()

Xato xabari yana takrorlandi:

Py4JJavaError: An error occurred while calling o898.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
    at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
    at org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate.doGenCode(interfaces.scala:342)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
    at scala.Option.getOrElse(Option.scala:121)
4
Bu bu ehtimol bilan bog'liq.
qo'shib qo'ydi muallif David Arenburg, manba
Bu bu ehtimol bilan bog'liq.
qo'shib qo'ydi muallif David Arenburg, manba

6 javoblar

If you require require a sequential rowNum value from 1 to n, rather than a monotonically_increasing_id you can use zipWithIndex()

Misol ma'lumotlarini quyidagi tarzda qayta yaratish:

rdd = sc.parallelize([('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1')])

Keyin har bir qatorga indeks qo'shish uchun zipWithIndex() dan foydalanishingiz mumkin. map ma'lumotni qayta formatlash va indeksni 1 ga qo'shish uchun ishlatiladi, shuning uchun u 1dan boshlanadi.

rdd_indexed = rdd.zipWithIndex().map(lambda x: (x[0][0],x[0][1],x[1]+1))
df = rdd_indexed.toDF(['id','score','rowNum'])
df.show()


+----+------+------+
|  id| score|rowNum|
+----+------+------+
|key1|value1|     1|
|key1|value1|     2|
|key1|value1|     3|
|key1|value1|     4|
|key1|value1|     5|
|key1|value1|     6|
+----+------+------+
1
qo'shib qo'ydi
rdd -ga df.rdd yordamida kirish mumkin, bu sizning fikringizdan foydalanishga imkon beradi. withColumn bilan birgalikda monotonically_increasing_id dan foydalanishni tavsiya qilaman. Shunga qaramasdan, bu yondashuv ketma-ketlikdagi identifikatorlarni kafolatlamaydi.
qo'shib qo'ydi muallif Jaco, manba
Ushbu yangi ustunni kiritish uchun mavjud dataframe-ga tayangan holda bo'lishim kerak. Shunday qilib, umid qilamanki DataColumn usuli bilan foydalanamiz.
qo'shib qo'ydi muallif Ivan Lee, manba

If you require require a sequential rowNum value from 1 to n, rather than a monotonically_increasing_id you can use zipWithIndex()

Misol ma'lumotlarini quyidagi tarzda qayta yaratish:

rdd = sc.parallelize([('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1')])

Keyin har bir qatorga indeks qo'shish uchun zipWithIndex() dan foydalanishingiz mumkin. map ma'lumotni qayta formatlash va indeksni 1 ga qo'shish uchun ishlatiladi, shuning uchun u 1dan boshlanadi.

rdd_indexed = rdd.zipWithIndex().map(lambda x: (x[0][0],x[0][1],x[1]+1))
df = rdd_indexed.toDF(['id','score','rowNum'])
df.show()


+----+------+------+
|  id| score|rowNum|
+----+------+------+
|key1|value1|     1|
|key1|value1|     2|
|key1|value1|     3|
|key1|value1|     4|
|key1|value1|     5|
|key1|value1|     6|
+----+------+------+
1
qo'shib qo'ydi
rdd -ga df.rdd yordamida kirish mumkin, bu sizning fikringizdan foydalanishga imkon beradi. withColumn bilan birgalikda monotonically_increasing_id dan foydalanishni tavsiya qilaman. Shunga qaramasdan, bu yondashuv ketma-ketlikdagi identifikatorlarni kafolatlamaydi.
qo'shib qo'ydi muallif Jaco, manba
Ushbu yangi ustunni kiritish uchun mavjud dataframe-ga tayangan holda bo'lishim kerak. Shunday qilib, umid qilamanki DataColumn usuli bilan foydalanamiz.
qo'shib qo'ydi muallif Ivan Lee, manba

You can do this with windows

from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
your_df= your_df.withColumn("rowNum", rowNumber().over(w))

Bu erda your_df - bu ustunga kerak bo'lgan ma'lumot doirasi.

1
qo'shib qo'ydi
Sparkning qaysi versiyasidan foydalanasiz? @IvanLee va RakesKumar?
qo'shib qo'ydi muallif Jaco, manba
Men dasturda sinab ko'rish uchun kodingizni ishlatardim. Muammolarni topdim: "module" obyekti "rowNumber" xususiyati yo'q. Shunday qilib, boshqa usul row_number topdim. Qator_ raqamlari ishlashi mumkin. Lekin kodni ishga tushirganimda: your_df.show (). Menda xato xabari bor. bu kabi sabablar: Nima uchun: java.lang.UnsupportedOperationException: ifoda qilish mumkin emas: row_number ()
qo'shib qo'ydi muallif Ivan Lee, manba
Ishonchim komilki, bu muammoning sababi turli versiyalarning ahamiyati. Men uchqunni ishlataman 2.1. Mening kodim row_number usuli bilan to'g'ri ishlaydi.
qo'shib qo'ydi muallif Ivan Lee, manba
Men bukchayni foydalanayapman 2,1, men bu muammoni hal qilishga harakat qilaman. Menimcha, bu turli xil versiyasidir.
qo'shib qo'ydi muallif Ivan Lee, manba
@IvanLee bu hal qilindi
qo'shib qo'ydi muallif Rakesh Kumar, manba
Men uchqun 1,6 dan foydalanmoqdaman
qo'shib qo'ydi muallif Rakesh Kumar, manba
Va men ushbu ishlab chiqarish kodida ishlatganimdek, bu so'zlar 100 barobar to'g'ri
qo'shib qo'ydi muallif Rakesh Kumar, manba
RowNumber import qildingiz
qo'shib qo'ydi muallif Rakesh Kumar, manba

You can do this with windows

from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
your_df= your_df.withColumn("rowNum", rowNumber().over(w))

Bu erda your_df - bu ustunga kerak bo'lgan ma'lumot doirasi.

1
qo'shib qo'ydi
Sparkning qaysi versiyasidan foydalanasiz? @IvanLee va RakesKumar?
qo'shib qo'ydi muallif Jaco, manba
Men dasturda sinab ko'rish uchun kodingizni ishlatardim. Muammolarni topdim: "module" obyekti "rowNumber" xususiyati yo'q. Shunday qilib, boshqa usul row_number topdim. Qator_ raqamlari ishlashi mumkin. Lekin kodni ishga tushirganimda: your_df.show (). Menda xato xabari bor. bu kabi sabablar: Nima uchun: java.lang.UnsupportedOperationException: ifoda qilish mumkin emas: row_number ()
qo'shib qo'ydi muallif Ivan Lee, manba
Ishonchim komilki, bu muammoning sababi turli versiyalarning ahamiyati. Men uchqunni ishlataman 2.1. Mening kodim row_number usuli bilan to'g'ri ishlaydi.
qo'shib qo'ydi muallif Ivan Lee, manba
Men bukchayni foydalanayapman 2,1, men bu muammoni hal qilishga harakat qilaman. Menimcha, bu turli xil versiyasidir.
qo'shib qo'ydi muallif Ivan Lee, manba
@IvanLee bu hal qilindi
qo'shib qo'ydi muallif Rakesh Kumar, manba
Men uchqun 1,6 dan foydalanmoqdaman
qo'shib qo'ydi muallif Rakesh Kumar, manba
Va men ushbu ishlab chiqarish kodida ishlatganimdek, bu so'zlar 100 barobar to'g'ri
qo'shib qo'ydi muallif Rakesh Kumar, manba
RowNumber import qildingiz
qo'shib qo'ydi muallif Rakesh Kumar, manba

Spark2.2 dan foydalanar va " row_number() " ishini topdi.

import pyspark.sql import functions as F
import pyspark.sql.window import Window

win_row_number = Window.orderBy("col_name")
df_row_number = df.select("col_name", F.row_number().over(win_row_number))
0
qo'shib qo'ydi

Spark2.2 dan foydalanar va " row_number() " ishini topdi.

import pyspark.sql import functions as F
import pyspark.sql.window import Window

win_row_number = Window.orderBy("col_name")
df_row_number = df.select("col_name", F.row_number().over(win_row_number))
0
qo'shib qo'ydi