Topic: ACTCD19: 2-Gram Token Model of C/C++
Author: Andrew Tomazos <andrewtomazos@gmail.com>
Date: Fri, 22 Mar 2019 10:42:57 +1000
Raw View
--0000000000008c56e00584a42385
Content-Type: text/plain; charset="UTF-8"
I'm releasing an artifact from ACTCD19 today which is a 2 token model of
C/C++.
In short, I took all the ~1 billion lines of C/C++ from a popular linux
distribution package archive (Debian Sid), tokenized it, found the most
common 2^16 tokens and then looked at consecutive pairs of those common
tokens.
For each unique consecutive pair of tokens I counted the number of
occurences of the token pair - and then ranked them by number of occurences
- printing the list to a text file.
The full list is here: https://github.com/tomazos/actcd19 (23 MB)
For a sample the top 100 is here:
OCCURENCES TOKEN1 TOKEN2
164945669 ) ;
67929216 ; }
65467334 ) {
53726432 if (
51458279 ( )
51424109 ) )
43881403 , 0
41003732 0 ,
30636098 ; if
24879109 ) ,
22700384 , {
22126128 } ,
18481475 # define
18209227 0 ;
18084000 , -
16609718 ( (
16418272 } }
16280145 0x00 ,
16157455 , 0x00
15373250 = 0
15177226 # include
15131545 0 )
12608580 ] =
12443840 ; return
11826846 1 ,
11702558 , &
11003066 ( !
10700212 * )
10334144 char *
9977210 = (
9785911 - 1
9164656 { if
9155554 , 0x00000000
9107322 1 )
8712605 ( &
8712604 , 1
8551953 0x00000000 ,
8259921 break ;
8194044 60 ,
8145750 , 60
8091734 } else
7982929 ] ;
7851333 ; int
7782848 ( const
7715721 , const
7706712 ) (
7449936 # endif
7221090 for (
7113503 { return
7076402 1 ;
7060939 std ::
6988426 } if
6926199 ( *
6922312 ) return
6869464 , (
6808548 [ i
6741089 ; #
6707967 } ;
6480366 ; case
6409883 0 ]
6391962 } static
6359919 i ]
6309078 [ 0
6247162 , NULL
6226628 ] )
6127050 , int
6086446 ] .
5899953 } void
5893746 ) #
5885488 ; i
5827183 NULL )
5599382 ) .
5587185 ; break
5556216 ; static
5319908 1 ]
5186289 else {
4971658 ) ->
4959459 ; for
4938293 2 ,
4923439 ( int
4907318 ( void
4881240 NULL ;
4805595 const char
4773166 sizeof (
4757621 ) ==
4750206 ; void
4684770 ) const
4643800 , 2
4605787 void *
4576513 NULL ,
4559738 ( struct
4548680 , 0x0000
4509706 0x0000 ,
4494806 ] ,
4491637 > (
4369493 ++ )
4179616 ( 0
3917628 ( i
3914781 [ 1
3846912 i =
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAB%2B4KHKiR95WSrAwggSToUQ%3Dke%3Dk50EdDvDKR%2B%2BRAM0DZ92%2Bng%40mail.gmail.com.
--0000000000008c56e00584a42385
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">I'm releasing an art=
ifact from ACTCD19 today which is a 2 token model of C/C++.<div><br></div><=
div>In short, I took all the ~1 billion lines of C/C++ from a popular linux=
distribution package archive (Debian Sid), tokenized it, found the most co=
mmon 2^16 tokens and then looked at consecutive pairs of those common token=
s.</div><div><br></div><div>For each unique consecutive pair of tokens I co=
unted the number of occurences of the token pair - and then ranked them by =
number of occurences - printing the list to a text file.</div><div><br></di=
v><div>The full list is here:=C2=A0<a href=3D"https://github.com/tomazos/ac=
tcd19">https://github.com/tomazos/actcd19</a>=C2=A0 (23 MB)</div><div><br><=
/div><div>For a sample the top 100 is here:</div><div><br></div><div>OCCURE=
NCES TOKEN1 TOKEN2</div><div><div>164945669 ) ;</div><div>67929216 ; }</div=
><div>65467334 ) {</div><div>53726432 if (</div><div>51458279 ( )</div><div=
>51424109 ) )</div><div>43881403 , 0</div><div>41003732 0 ,</div><div>30636=
098 ; if</div><div>24879109 ) ,</div><div>22700384 , {</div><div>22126128 }=
,</div><div>18481475 # define</div><div>18209227 0 ;</div><div>18084000 , =
-</div><div>16609718 ( (</div><div>16418272 } }</div><div>16280145 0x00 ,</=
div><div>16157455 , 0x00</div><div>15373250 =3D 0</div><div>15177226 # incl=
ude</div><div>15131545 0 )</div><div>12608580 ] =3D</div><div>12443840 ; re=
turn</div><div>11826846 1 ,</div><div>11702558 , &</div><div>11003066 (=
!</div><div>10700212 * )</div><div>10334144 char *</div><div>9977210 =3D (=
</div><div>9785911 - 1</div><div>9164656 { if</div><div>9155554 , 0x0000000=
0</div><div>9107322 1 )</div><div>8712605 ( &</div><div>8712604 , 1</di=
v><div>8551953 0x00000000 ,</div><div>8259921 break ;</div><div>8194044 60 =
,</div><div>8145750 , 60</div><div>8091734 } else</div><div>7982929 ] ;</di=
v><div>7851333 ; int</div><div>7782848 ( const</div><div>7715721 , const</d=
iv><div>7706712 ) (</div><div>7449936 # endif</div><div>7221090 for (</div>=
<div>7113503 { return</div><div>7076402 1 ;</div><div>7060939 std ::</div><=
div>6988426 } if</div><div>6926199 ( *</div><div>6922312 ) return</div><div=
>6869464 , (</div><div>6808548 [ i</div><div>6741089 ; #</div><div>6707967 =
} ;</div><div>6480366 ; case</div><div>6409883 0 ]</div><div>6391962 } stat=
ic</div><div>6359919 i ]</div><div>6309078 [ 0</div><div>6247162 , NULL</di=
v><div>6226628 ] )</div><div>6127050 , int</div><div>6086446 ] .</div><div>=
5899953 } void</div><div>5893746 ) #</div><div>5885488 ; i</div><div>582718=
3 NULL )</div><div>5599382 ) .</div><div>5587185 ; break</div><div>5556216 =
; static</div><div>5319908 1 ]</div><div>5186289 else {</div><div>4971658 )=
-></div><div>4959459 ; for</div><div>4938293 2 ,</div><div>4923439 ( in=
t</div><div>4907318 ( void</div><div>4881240 NULL ;</div><div>4805595 const=
char</div><div>4773166 sizeof (</div><div>4757621 ) =3D=3D</div><div>47502=
06 ; void</div><div>4684770 ) const</div><div>4643800 , 2</div><div>4605787=
void *</div><div>4576513 NULL ,</div><div>4559738 ( struct</div><div>45486=
80 , 0x0000</div><div>4509706 0x0000 ,</div><div>4494806 ] ,</div><div>4491=
637 > (</div><div>4369493 ++ )</div><div>4179616 ( 0</div><div>3917628 (=
i</div><div>3914781 [ 1</div><div>3846912 i =3D</div></div><div><br></div>=
<div><br></div></div></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/CAB%2B4KHKiR95WSrAwggSToUQ%3Dke%3Dk50=
EdDvDKR%2B%2BRAM0DZ92%2Bng%40mail.gmail.com?utm_medium=3Demail&utm_source=
=3Dfooter">https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAB=
%2B4KHKiR95WSrAwggSToUQ%3Dke%3Dk50EdDvDKR%2B%2BRAM0DZ92%2Bng%40mail.gmail.c=
om</a>.<br />
--0000000000008c56e00584a42385--
.