Ikki satrda umumiy prefiks uzunligini topish

Barcha fayllar uchun (taxminan 30000) faylda topmoqchiman boshida belgilar soni Hozirgi chiziq oldingi satr bilan bir xil bo'ladi. Masalan, kirish:

#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

Men kutaman:

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/

perl da ishlashga harakat qilaman va belgilarning belgilarini ochish va birinchi mos kelmagunga qadar hisoblashni xohlayman, lekin awk funktsiyalarini ishlatib, ba'zi bir juda sekin uslub mavjudmi yoki yo'qmi deb o'ylayman perl .

Yangilash: Men javob sifatida o'z tashabbusimni qo'shdim.

6

6 javoblar

Bu kabi, ehtimol?

Perl da yozilgan

use strict;
use warnings 'all';

my $prev = "";

while ( my $line =  ) {

    chomp $line;

    my $max = 0;
    ++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

    printf "%-2d  %s\n", $max-1, $line;

    $prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

chiqdi

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]
2
qo'shib qo'ydi

Hech qanday qurilma siz uchun bunday qilmaydi, bir vaqtning o'zida har bir satrning yarmini bir xil ikkilik qidiruvda (yarim-assed awk pseudo-code) o'xshash narsa bilan taqqoslashingiz mumkin bo'lgan bir vaqtning o'zida 1 ta belgiga o'tishdan ko'ra:

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    partCurr = substr(curr,1,partLgth)
    partPrev = substr(prev,1,partLgth)
    if ( partCurr == partPrev ) {
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

Agar siz yuqorida ko'rsatilgan pastki qatordan chiqmasangiz, taqqoslash uchun pastki qatorlar bo'lmasa va natijada:

  1. 2 pastki satrlar oldingi iteratsiya bilan mos tushadi oldingi mag'lubiyatning uzunligi taalukli substralarni maksimal uzunligi yoki
  2. 2 tagacha satrlarni topilmadi, shuning uchun 2 satr bilan qisman mos kelmaydi.

Bu, potentsial ravishda juda kamroq yinelemalarni char-by-char taqqoslashdan ko'ra foydalanadi, ammo yozib qo'yilgandek, har bir iteratsiya bo'yicha taqqoslash o'rniga stringni amalga oshiradi, shuning uchun aniq ishlash natijasi qanday bo'lishi kerak. Har bir iteratsiya bo'yicha dastlabki mag'lubiyatga taqqoslash o'rniga, bir belgini bajarish bilan tezlashtirishi mumkin va faqatgina belgilar hozirgi holatda mos keladigan bo'lsa, mag'lubiyatga solishtirishni amalga oshirasiz:

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    if ( substr(curr,partLgth,1) == substr(prev,partLgth,1) )
        isMatch = (substr(curr,1,partLgth) == substr(prev,1,partLgth) ? 1 : 0)
    }
    else {
        isMatch = 0
    }
    if ( isMatch ) 
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}
1
qo'shib qo'ydi
O'zining savolida O'X o'zida bir vaqtning o'zida 1 ta belgiga ega bo'lgan yechim borligini aytdi va bu tezroq yondoshishni talab qildi, chunki bu juda sekin ( Perlda ishlamoqchi bo'lgan belgilarni belgi ichiga ochib, birinchi mos kelmasin, lekin juda asta sekin usul emas, deb o'ylayman).
qo'shib qo'ydi muallif Ed Morton, manba
Ehtimol tezroq bo'lishi mumkin bo'lgan narsalarni optimallashtirishga o'xshaysiz
qo'shib qo'ydi muallif Borodin, manba
Qani ko'rayapman. Buni o'qib chiqmadim, chunki ular allaqachon hal qildilar. Balki siz haqsiz. Individual belgilar bilan ishlashning asta-qismi split , bu qator va qator skaler o'zgaruvchilarni yaratish kerak.
qo'shib qo'ydi muallif Borodin, manba

Using

awk -v FS="" 'p{
    pl=0; 
    split(p,a,r); 
    for(i=1;i in a; i++)
          if(a[i]==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

YoKI

awk -v FS="" 'p{
     pl=0; 
     for(i=1;i<=NF; i++)
     if(substr(p,i,1)==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

Kirish

$ cat file
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

Chiqish

$ awk -v FS="" 'p{pl=0; split(p,a,r); for(i=1;i in a; i++)if(a[i]==$i){ pl++ }else { break }}{ print pl+0,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

Explanation

awk -v FS="" '                                  # call awk set field sep=""
       p{
           pl=0;                                # reset variable pl
           split(p,a,r);                        # split variable p
           for(i=1;i in a; i++)                 # loop through array
                 if(a[i]==$i){                  # check array element with current field
                     pl++                       # if matched then increment pl
                 }else { 
                     break                      # else its over break loop
                 }
        }
        { 
            print pl+0,$0;                      # print count, and current record
            p=$0                                # store current record in variable p
        }
     ' file

Standartlar, agar FS ga bo'sh satr tayinlangan bo'lsa, natijalar aniqlanmaganligini ta'kidlang. awk ning ba'zi versiyalari sizning namunangizda yuqorida keltirilgan chiqishni hosil qiladi. OS/X da awk versiyasi ogohlantirish va chiqishni keltirib chiqaradi.

awk: field separator FS is empty

Demak, FS ni bo'sh satrga o'rnatishning maxsus ma'nosi har awk da ishlamaydi.

1
qo'shib qo'ydi
To'g'ri, FSni null magistralga o'rnatishda ishlab chiqarilgan xatti-harakatlar POSIX tomonidan aniqlanmagan, shuning uchun har qanday awk uni yoqtiradigan narsani qila oladi va POSIXga mos kelishi mumkin. GNU awk (va boshqalar?) Foydali xususiyatlarga ega bo'lgan belgilarga bo'linishni tanlashadi.
qo'shib qo'ydi muallif Ed Morton, manba
@EdMorton: ha rost, foydali ma'lumot uchun rahmat
qo'shib qo'ydi muallif Akshay Hegde, manba

Buni to'g'ridan-to'g'ri gawk bilan qilishingiz mumkin. Bu erda faqat joriy satrni avvalgi bilan solishtiradi va umumiy etakchi belgilar sonini hisoblaydi:

BEGIN{
    prev="";
}
{
    curr=$1;
    n = length(curr);
    m = length(prev);
    s = n<= s;i++){
        if(substr(curr, i, 1) == substr(prev, i, 1)){
            cnt++;
        }else{
            break;
        }
    }
    print(cnt, curr);

    prev=curr;
}
1
qo'shib qo'ydi

Awk-da:

$ awk -F '' '{n=split(p,a,"");for(i=1;i<=(NF

Ta'riflangan:

awk -F '' '{                                # each char on its own field
    n=split(p,a,"")                         # split prev record p each char in own a cell
    for(i=1;i<=(NF
0
qo'shib qo'ydi
OK, shuning uchun men AkshayHegde echimlariga o'xshash bo'lib chiqdi (ajoyib ta'm uchun ++, ba'zi bir farqlar bilan birga emasman), shuning uchun men buni bu yerdan tark etishga jur'at etmoqdaman. FS haqida o'z mulohazasiga sharhlar shu echim uchun ham amal qiladi.
qo'shib qo'ydi muallif James Brown, manba

perl skript:

#!/usr/bin/perl -ln
$c = [ unpack "C*" ]; #current record
$i = 0;
$i++ while $p->[$i] == $c->[$i]; # count till mismatch
print "$i $_";
$p = $c               #save current record for next time

Buyruqlar qatori bayroqlari bilan bir xil narsa:

#!/usr/bin/perl
while (<>) {
    chomp;
    $c = [ unpack "C*" ];
    $i = 0;
    $i++ while $p->[$i] == $c->[$i];
    print "$i $_\n";
    $p = $c
}

Bitta liner bilan bir xil narsa:

perl -lne '$c=[unpack "C*"]; $i=0; $i++ while $p->[$i] == $c->[$i]; print "$i $_"; $p = $c'

Satrlarni o'z ichiga olgan fayl (lar) ni argument (lar) deb o'zgartiring yoki ma'lumotlarni buyruqga o'tkazing.

Haqiqiy ma'lumotlarim bo'yicha, bu Borodin echimi haqida tezroq ishlaydi:

$ xzcat href.xz |wc -l
33150
$ time xzcat href.xz | ./borodin.pl >borodin.out

real    0m2.437s
user    0m2.684s
sys     0m0.080s
$ time xzcat href.xz | ./pk.pl > pk.out 

real    0m2.305s
user    0m2.564s
sys     0m0.088s
$ diff pk.out borodin.out 
0
qo'shib qo'ydi